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Abstract 


Given  a  parallel  machine  with  processors  arranged  in  some  particular  network 
topology  (e.g.,  on  a  mesh  machine  the  processors  are  2U'ranged  in  a  rectangu¬ 
lar  grid),  we  have  to  execute  different  parallel  jobs.  Each  job  requires  some 
part  of  the  machine  (e.g.,  a  mesh  of  a  smaller  size),  <ind  can  be  executed  on 
any  subset  of  processors  with  that  network  topology.  Each  job  will  run  for 
some  fixed  time,  regardless  of  when  we  execute  it.  But  we  do  not  know  the 
running  times  in  advance,  the  only  way  to  determine  the  running  time  of 
a  job  is  to  execute  it.  Scheduling  may  also  be  constrained  by  dependencies 
between  jobs;  it  may  be  the  case  that  a  job  czumot  be  started  until  some 
other  jobs  have  finished.  Our  task  is  to  schedule  a  given  set  of  jobs  so  that 
all  constraints  are  satisfied  and  the  total  time  is  as  small  as  possible. 

We  claim  that  in  this  model  efficient  on-line  scheduling  is  possible  on 
a  variety  of  different  parallel  machines,  including  PRAMs,  hypercubes  and 
mesh  machines.  However,  the  efficiency  depends  on  various  factors,  including 
the  presence  of  dependencies,  the  combinatorial  complexity  of  the  network 
topology,  randomization,  the  use  of  virtualization,  and  the  maximal  size  of 
jobs. 

We  show  that  without  dependencies,  randomized  algorithms  can  achieve 
a  significantly  better  performance  than  deterministic  ones;  on  the  other  hand 
with  dependencies  randomization  does  not  help. 

The  complexity  of  the  network  topology  has  a  big  influence  both  with  and 
without  dependencies.  For  more  complex  networks  the  optimal  performance 
is  significantly  worse.  Without  dependencies,  it  is  to  some  extent  possible 
to  avoid  this  loss  of  performance  by  using  more  sophisticated  algorithms  for 
more  complex  networks;  we  show  that  the  greedy  method,  which  is  a  natural 
method  for  on-line  scheduling,  works  very  well  for  simple  cases  but  it  is  not 
efficient  for  more  complex  machines. 

With  dependencies,  we  show  that  to  achieve  a  good  overall  performance, 
it  is  sometimes  essential  to  use  virtualization,  i.e.,  to  schedule  some  jobs  on 
a  smaller  number  of  processors  than  they  request,  even  though  it  means  that 
their  running  times  increase  proportionally.  Also  limiting  the  maximal  size 
of  jobs  improves  the  performance  with  dependencies.  On  the  other  hand. 
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without  dependencies  virtualization  and  the  maximal  size  of  jobs  are  not 
important  factors. 

We  also  study  another  model  in  which  we  only  have  sequential  jobs  (jobs 
requiring  only  one  processor).  As  opposed  to  the  other  model,  they  arrive 
and  have  to  be  scheduled  one  by  one  in  a  predetermined  order.  The  running 
time  of  a  job  is  known  as  soon  as  it  arrives,  but  we  have  to  schedule  the 
job  immediately  without  any  knowledge  of  future  jobs.  The  goal  is  again 
to  schedule  all  jobs  so  that  the  total  time  is  as  small  as  possible.  In  this 
model,  we  significantly  improve  the  previously  known  lower  bounds  on  the 
performance  of  randomized  algorithms. 
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1  Introduction 


This  thesis  is  a  theoretical  study  of  scheduling  jobs  on  parallel  machines  when 
only  partial  information  about  them  is  av^lable  in  advance.  We  study  how 
the  length  of  the  schedule  changes  imder  the  influence  of  various  factors,  such 
as  the  use  of  randomization,  the  presence  of  dependencies  between  jobs,  the 
choice  of  network  topology,  the  use  of  virtualization,  and  the  maximal  size 
of  jobs.  We  prove  tight  or  very  close  upper  and  lower  bounds  on  the  best 
possible  performance  of  on-line  scheduling  algorithms  for  many  combinations 
of  these  factors,  thus  giving  an  essentially  complete  picture  of  how  they 
interact  suid  together  influence  the  performance  and  methods  for  scheduling. 

Our  on-line  scenario  reflects  the  real  life  situation  where  we  rarely  have  full 
information  about  the  jobs  that  we  wish  to  schedule;  similarly,  the  factors 
and  network  topologies  we  study  are  derived  from  practical  concerns.  We 
believe  that  with  the  development  of  larger  parallel  machines  and  increase  of 
the  amount  and  Vtiriety  of  applications  performed  on  them,  the  results  of  this 
thesis  will  become  relevant  for  the  choice  of  the  scheduling  strategies  used  in 
practice. 

1.1  On-line  scheduling  of  parallel  jobs 

Imagine  that  we  have  a  massively  parallel  machine  with  processors  arranged 
in  a  rectangular  grid.  This  machine  is  used  by  a  number  of  users  who  submit 
different  parallel  jobs  to  l)e  executed  on  the  parallel  machine.  Each  job 
will  run  for  some  fixed  time,  called  its  running  time,  regardless  of  when  we 
execute  it.  But  we  do  not  know  the  running  times  in  advance,  the  only  way 
to  determine  the  running  time  of  a  job  is  to  execute  it — this  is  what  we  call 
an  on-line  problem.  Our  task  is  to  schedule  a  given  set  of  jobs  so  that  the 
total  time  is  cis  small  eis  possible. 

Some  parallel  jobs  may  make  use  of  all  available  processors.  Other  jobs 
may  be  able  to  use  only  a  limited  number  of  processors  efficiently:  they 
request  a  rectangular  grid  of  some  smaller  size,  and  can  be  executed  on 
any  subset  of  processors  that  form  such  a  grid.  We  assume  that  the  grid 
structure  has  to  be  preserved  because  efficient  parallel  programs  are  written 
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for  a  particular  network  topology  and  in  general  it  is  impossible  to  execute 
them  efficiently  on  a  set  of  processors  with  a  different  network  topology.  We 
can  schedule  more  than  one  job,  as  long  as  we  can  satisfy  the  requirements 
of  adl  scheduled  jobs  simultaneously. 

Scheduling  may  also  be  constrained  by  dependencies  between  different 
jobs.  This  means  that  it  may  be  the  case  that  a  job  c2mnot  be  started  before 
some  other  jobs  are  finished.  In  fact,  in  the  on-line  problem  we  might  not 
even  know  about  the  existence  of  a  job  until  all  jobs  on  which  it  depends  are 
finished.  Our  schedule  has  to  respect  these  constraints. 

Once  a  job  is  started,  it  has  to  be  completed  on  the  saune  processors 
without  stopping,  it  cannot  be  moved  to  other  processors  or  stopped  and 
restarted  later  on  the  same  or  different  processors. 

This  problem  can  be  viewed  in  a  very  geometric  way.  We  have  an  empty 
square  representing  the  machine  with  processors  arranged  in  a  square  grid, 
and  a  set  of  rectangles  that  represent  the  jobs  that  have  to  be  scheduled. 
At  the  beginning  we  want  to  tile  the  square  by  the  rectangles  so  that  a 
large  freiction  of  the  space  is  used:  already  this  is  a  nontrivial  problem.  But 
in  the  on-line  setting,  and  especially  in  the  presence  of  dependencies,  the 
situation  gets  much  worse.  .After  some  jobs  finish,  the  available  processors 
may  occupy  a  region  that  is  much  more  complex  and  difficult  to  tile  than 
the  original  square;  in  fact  it  might  be  impossible  to  tile  it  by  the  remaining 
rectangles.  It  is  this  interaction  of  the  geometric  structure  of  the  machine 
with  the  on-line  situation  that  makes  the  problem  difficult. 

We  study  this  problem  not  only  for  two-dimensional  meshes  as  intro¬ 
duced  above,  but  also  for  a  variety  of  other  network  topologies.  PRAMs. 
hypercubes,  one-  and  higher-dimensional  meshes.  We  claim  that  efficient 
on-line  scheduling  is  possible  for  all  these  topologies,  both  with  and  without 
dependencies.  However,  the  performance  depends  significantly  on  various 
factors. 

Not  surprisingly,  the  presence  of  dependencies  has  a  major  influence.  Not 
only  is  the  performance  without  dependencies  significantly  better,  but  also 
the  influence  of  other  factors,  discussed  below,  changes  dramatically. 

The  network  topology  of  the  machine  is  very  important  both  with  and 
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without  dependencies.  For  more  complex  networks  it  is  necessary  to  use  more 
sophisticated  algorithms  and  the  optimal  performance  is  worse,  especially  in 
the  presence  of  dependencies.  We  show  that  the  greedy  method,  which  is  a 
natural  method  for  on-line  scheduling,  works  very  well  for  simple  machines 
but  is  not  efficient  for  more  complex  ones. 

Another  important  factor  is  whether  the  algorithm  is  allowed  to  make 
use  of  randomness.  Without  dependencies,  we  give  a  randomized  algorithm 
for  meshes  which  has  much  better  performamce  than  an  optimal  determin¬ 
istic  one.  We  know  of  only  one  other  result  that  shows  for  some  model  of 
on-line  scheduling  that  randomization  provably  improves  the  performance — 
namely  the  result  of  [BFKV92]  on  randomized  scheduling  of  sequential  jobs 
on  two  processors,  which  we  discuss  in  Section  17.  However,  in  that  result 
the  improvement  is  only  by  a  small  constant  factor,  whereas  in  our  case  the 
improvement  is  much  larger;  also  our  technical  tools  are  different  and  much 
more  involved. 

Surprisingly,  in  contrast  to  the  randomized  algorithm  above,  we  demon¬ 
strate  that  with  dependencies  randomization  cannot  improve  the  perfor¬ 
mance  significantly. 

We  show  that  to  achieve  overall  efficiency  of  scheduling  with  dependen¬ 
cies,  it  is  essentizJ  to  schedule  some  jobs  on  a  smaller  number  of  processors 
than  they  request,  even  though  it  means  that  their  running  times  increase 
proportionally.  This  technique  of  running  a  job  on  a  smaller  number  of  pro¬ 
cessors  than  it  could  use  is  called  virtualization  and  it  is  a  standard  tool  in 
the  design  of  parallel  algorithms.  Our  work  shows  its  im[x>rtance  in  another 
area  by  demonstrating  that  it  is  a  necessary  and  useful  technique  for  efficient 
on-line  scheduling  of  parallel  jobs.  If  we  bound  the  maximal  size  of  jobs  (the 
requested  number  of  processors),  it  improves  the  optimal  performance  in  the 
presence  of  dependencies.  Without  dependencies,  neither  bounding  the  size 
of  jobs  nor  using  virtualization  has  a  significant  influence. 

We  are  interested  in  theoretical  cispects  of  this  problem,  abstracting  from 
some  questions  that  may  be  important  for  practical  systems.  However,  the 
additional  costs  that  we  neglect  are  small,  if  the  jobs  do  not  have  extremely 
short  running  times,  which  is  usually  the  case.  We  assume  that  the  startup 
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costs  are  included  in  the  running  time  of  a  job,  which  means  that  we  neglect 
the  fact  that  these  costs  may  depend  on  the  particular  placement  of  the 
jobs.  We  also  do  not  consider  the  costs  of  communication  between  dependent 
jobs,  and  between  the  parallel  machine  cind  the  users.  We  do  not  consider 
the  overhead  of  the  scheduling  algorithm;  however,  all  our  algorithms  are 
simple  and  easy  to  implement  even  in  a  distributed  environment,  hence  this 
overhead  is  very  small. 


1.2  Randomized  on-line  scheduling  of  sequential  jobs 

For  sequential  jobs  (jobs  that  require  only  one  processor)  we  consider  another 
model,  which  was  studied  before  in  [Gra66,  GW93,  BFKV92,  KPT94]. 

This  model  is  essentially  a  modified  version  of  the  gztme  of  Tetris.  We 
have  some  fixed  number  of  columns.  Rectangles  arrive  one  by  one,  e2w:h  of 
them  is  one  column  wide  and  extends  over  one  or  more  rows.  We  have  to 
put  each  rectangle  in  one  of  the  columns.  The  goal  is  to  minimize  the  total 
number  of  rows  that  are  at  least  partially  used  by  the  rectangles. 

In  this  scenario  the  columns  represent  the  processors,  rows  represent  the 
time  steps  and  the  rectangles  represent  the  jobs.  The  jobs  are  sequential, 
which  is  represented  by  the  requirement  that  every  rectangle  is  only  one 
column  wide.  The  running  time  is  represented  by  the  height  of  a  rectangle. 
The  running  time  is  known  when  a  job  arrives,  unlike  in  our  model  for  parallel 
jobs.  The  jobs  arrive  one  by  one,  as  soon  as  a  job  arrives,  the  scheduler  has 
to  assign  it  to  one  of  the  processors. 

In  this  model,  the  problem  is  completely  solved  only  for  two  processors,  in 
which  Ctise  an  optimal  randomized  algorithm  is  known  and  it  is  better  than 
the  optimal  deterministic  algorithm.  For  more  processors,  the  best  known 
algorithm  is  deterministic,  which  means  that  we  do  not  know  how  to  make 
use  of  randomization  at  all. 

We  prove  a  lower  bound  on  the  performance  of  randomized  algorithms 
in  this  model,  which  improves  significantly  on  the  previously  known  bounds 
for  more  than  two  processors.  It  also  gives  significant  insight  about  how  a 
randomized  algorithm  matching  this  bound  heis  to  work.  However,  we  are 
not  able  to  give  such  an  algorithm  at  this  time. 
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1.3  Outline 

In  the  first  three  parts  of  the  thesis  we  study  scheduling  of  parallel  jobs.  In 
Part  I  we  introduce  our  model  for  scheduling  of  parallel  jobs,  summarize  our 
results  and  present  some  basic  techniques.  In  Part  II  we  prove  the  results 
about  scheduling  without  dependencies,  and  in  Part  III  we  prove  the  results 
about  scheduling  with  dependencies. 

The  result  on  scheduling  of  sequential  jobs  is  presented  in  Part  IV,  which 
includes  the  definitions  and  overview  of  the  results  and  previous  work.  This 
part  is  completely  self-contained  and  independent  of  previous  parts. 
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Part  I 

A  model  and  results  for 
scheduling  parallel  jobs 

We  first  define  our  model  and  discuss  the  practical  motivation  Sections  2 
and  3.  Then  in  Sections  4  and  5  we  present  and  discuss  our  results.  We 
discuss  some  possible  alternatives  to  our  model  and  related  work  in  Section  6. 
Finally,  in  Section  7  we  introduce  some  basic  technical  tools  and  notation 
used  in  Parts  II  and  III. 


2  The  model 

2.1  Parallel  machines  and  network  topologies 

For  our  purpose,  a  parallel  machine  with  a  specific  network  topology  is  an 
undirected  graph  where  the  nodes  represent  processors  and  the  edges  repre¬ 
sent  the  communication  links  between  the  processors.  A  set  of  job  types  is  a 
collection  of  subgraphs  of  the  machine  that  can  be  requested  by  a  job.  Each 
job  requests  a  particular  job  type,  which  means  that  it  can  be  scheduled  on 
any  subgraph  of  the  machine  isomorphic  to  its  job  type.  We  assume  that  on 
any  machine  a  job  may  request  the  whole  machine  (if  it  uses  full  parallelism) 
or  a  single  processor  (if  it  is  an  inherently  sequential  job).  In  our  represen¬ 
tation  it  means  that  the  set  of  job  types  always  contains  at  least  the  graph 
representing  the  whole  machine  and  the  graph  with  a  single  node. 

Let  us  define  the  network  topologies  that  we  consider  in  this  thesis.  The 
simplest  theoretical  model  of  a  parallel  machine  is  a  PRAM,  which  supports 
direct  communication  between  any  two  processors.  We  represent  a  PRAM 
machine  with  N  processors  as  a  complete  graph  with  N  nodes,  which  reflects 
the  fact  that  direct  communication  between  any  two  processors  is  supported. 
Available  job  types  are  all  complete  graphs  with  up  to  N  nodes.  This  rep¬ 
resents  the  fact  that  a  job  requesting  p  processors  can  be  executed  on  any 


7 


subset  of  p  processors. 

Machines  whose  underlying  topology  is  a  d-dimensional  mesh  are  repre¬ 
sented  by  grid  graphs  of  dimension  d.  Job  types  are  all  grids  with  dimensions 
smaller  than  or  equal  to  the  dimensions  of  the  machine.  The  one-dimensional 
mesh  machine  consists  of  N  processors  arranged  on  a  line,  each  of  them  is 
connected  to  its  neighbors.  Any  job  must  be  executed  on  a  connected  seg¬ 
ment  of  the  machine.  Thus  the  job  types  are  all  contiguous  segments  with 
up  to  N  processors.  A  two-dimensional  nixnj  mesh  machine  is  represented 
by  a  rectangular  grid  with  width  ni  and  height  uj.  The  set  of  job  types 
is  the  set  of  2dl  smaller  rectamgular  grids.  In  contrast  to  PRAMs  and  one¬ 
dimensional  meshes,  in  the  case  of  two-dimensional  meshes  the  job  types  are 
not  determined  by  the  number  of  processors,  since  we  distinguish  between  a 
10  X 10  rectangle  and  a  5  x  20  rectangle. 

A  d-dimensional  hypercube  has  N  = processors  indexed  by  vectors  of 
d  bits.  Two  processors  are  connected  if  their  indices  differ  in  exactly  one  bit. 
Avadlable  job  types  are  all  d'-dimensional  subcubes  for  d'  <  d. 

2.2  Parallel  jobs  and  virtualization 

A  parallel  job  is  characterized  by  the  job  type  G  and  the  running  time  t  on 
a  set  of  processors  of  that  job  type.  We  assume  that  all  processors  run  at 
the  same  speed,  thus  if  a  job  is  executed  on  two  isomorphic  subgraphs,  the 
running  times  are  equal.  The  work  of  a  job  is  t|G|,  where  |G|  is  the  size  of 
the  job,  i.e.,  the  number  of  processors  requested  by  the  job.  .A  sequential  job 
is  a  job  requesting  one  processor. 

An  important  fact  is  that  any  parallel  job  can  be  scheduled  on  fewer 
processors  than  it  requests.  In  the  extreme  case  we  can  run  it  on  a  single 
processor.  Then  all  the  work  is  done  by  that  processor,  which  increases  the 
running  time  to  the  product  of  the  requested  number  of  processors  and  the 
original  running  time  of  the  job.  In  general,  a  job  is  executed  on  a  smaller 
set  of  processors,  each  of  them  simulating  several  processors  requested  by  the 
job,  and  the  running  time  is  proportionally  larger.  This  technique  is  called 
virtualization  and  Ccin  be  implemented  by  the  operating  system  without  any 
knowledge  of  the  algorithm  executed  by  the  parallel  job  [HB84,  Sar89,  Ble90, 
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SOG'*'94].  Virtualization  yields  good  results  if  the  mapping  of  the  requested 
processors  on  the  smaller  set  preserves  the  network  topology.  This  can  be 
achieved  for  machines  with  regular  topology,  which  includes  all  machines 
considered  in  this  work,  PRAMs,  meshes  and  hypercubes. 

A  job  requesting  a  machine  G  with  running  time  t  can  run  on  a  machine  G' 
in  time  a{G.,G’)t,  where  a{G,G')  is  the  simulation  factor  [BCH''‘88,  KA86]. 
Neither  the  running  time  nor  the  work  can  be  decreased  by  virtualization,  i.e., 
a{G,  G')  >  max(l,  \G\I\G'\).  If  the  network  topology  of  G  can  be  efficiently 
mapped  on  G',  che  work  does  not  increase.  We  assume  that  the  simulation 
assumptions  always  preserve  the  work.  This  is  a  reasonable  assumption,  since 
efficient  mappings  exist  for  all  network  topologies  we  consider. 

Under  this  assumption,  a  job  which  requests  p  processors  on  a  PRAM 
(resp.  one-dimensional  mesh)  can  be  simulated  efficiently  on  a  PRAM  (resp. 
one-dimensional  mesh)  of  p'  processors  for  p'  <  p  and  the  running  time 
increases  time  by  a  simulation  factor  of  p/p'.  A  job  requesting  an  ax6  mesh 
can  be  simulated  on  an  a'  x  6'  mesh  for  a'  <  a  and  h'  <  b  and  the  running 
time  increases  by  by  a  simulation  factor  of  ab/a'b'.  .4  job  requesting  a  d- 
dimensional  hypercube  can  be  run  on  a  t/'-dimensional  hypercube  for  d'  <  d 
and  the  running  time  increases  by  . 

2.3  Parallel  job  systems  and  schedules 

A  parallel  job  system  (without  dependencies)  is  a  collection  of  parallel  jobs. 
.4  parallel  job  system  unth  dependencies  is  a  collection  of  jobs  with  the  depen¬ 
dencies  given  by  a  directed  acyclic  graph  called  the  dependency  graph.  The 
nodes  of  the  graph  correspond  to  the  jobs  and  edges  correspond  to  dependen¬ 
cies  between  the  jobs.  .4  job  can  be  scheduled  only  when  all  its  predecessors 
in  the  dependency  graph  are  finished. 

A  schedule  is  an  assignment  of  a  set  of  processors  and  a  time  interval  to 
each  job  such  that  all  requirements  given  by  the  job  types,  running  times  and 
dependencies  are  satisfied.  That  means  that  (1)  the  processors  assigned  to  a 
job  correspond  to  its  job  type  or  can  simulate  it  using  virtualization.  (2)  the 
length  of  the  time  interval  assigned  to  a  job  is  its  running  time  multiplied  by 
the  simulation  factor,  if  virtualization  is  used,  and  (3)  if  there  is  a  dependency 
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between  two  jobs,  then  the  time  interval  assigned  to  the  first  job  ends  before 
the  interval  assigned  to  the  second  job  starts.  A  processor  can  be  assigned 
to  at  most  one  job  at  any  time. 

Once  a  job  is  started  on  some  processor  or  a  set  of  processors,  it  has  to 
run  on  them  until  completion.  We  do  not  allow  a  job  to  be  preempted,  i.e., 
to  move  it  to  a  different  set  of  processors  where  it  would  continue  to  run. 
We  also  do  not  allow  a  job  to  be  stopped  <md  restarted  later  on  the  same  or 
different  processors. 

2.4  The  scheduling  problem  and  scheduling  algo¬ 
rithms 

A  scheduling  problem  specifies  a  network  topology  together  with  a  set  of 
available  job  types  and  simulation  factors,  whether  dependencies  are  allowed 
and  whether  virtualization  can  be  used.  As  a  network  topology  usually  de¬ 
termines  the  job  types  and  simulation  factors  in  a  natural  way,  we  omit  this 
part  of  the  specification  most  of  the  time.  An  instance  of  the  scheduling 
problem  is  a  parallel  job  system  (with  dependencies,  if  they  are  allowed)  and 
a  machine  with  the  given  topology.  The  output  is  a  schedule  for  the  given 
job  system  on  the  given  machine. 

A  scheduling  algorithm  (for  a  given  scheduling  problem)  produces  a  sched¬ 
ule  for  any  instance  of  the  problem.  A  scheduling  algorithm  is  off-line  if  it 
receives  the  complete  information  as  its  input,  i.e.,  all  jobs,  their  dependen¬ 
cies  and  running  times.  In  the  on-line  problem,  the  running  times  are  not 
given  as  a  part  of  the  input,  but  can  only  be  determined  by  executing  the  jobs. 
For  on-line  algorithms,  we  distinguish  two  notions,  depending  on  whether  the 
jobs  and  their  resource  requirements  are  given  in  advance  or  only  when  the 
jobs  become  available.  We  say  that  an  algorithm  is  on-line  if  the  running 
times  are  only  determined  by  .scheduling  the  jobs  and  completing  them,  but 
the  dependency  graph  and  the  resource  requirements  may  be  known  in  ad¬ 
vance.  An  algorithm  is  fully  on-line  if  it  is  on-line  and  at  any  given  moment 
it  only  knows  the  resource  requirements  of  the  jobs  currently  available  but 
has  no  information  about  the  future  jobs  and  their  dependencies. 
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2.5  The  performance  measure 

We  measure  the  performance  by  the  length  of  a  schedule,  which  is  the  time 
when  the  last  job  finishes  (the  makespan).  An  optimal  schedule  for  a  given 
job  system  is  a  schedule  with  the  minimal  length.  An  optimal  schedule  can 
be  computed  off-line,  with  knowledge  of  all  running  times  amd  dependencies, 
and  with  unlimited  computational  power. 

An  on-line  algorithm  is  evaluated  by  the  competitive  ratio,  introduced  by 
Sleator  and  Tarjan  in  [ST85],  which  compares  the  performance  achieved  by 
the  on-line  algorithm  to  the  optimal  schedule.  In  the  case  of  a  scheduling 
algorithm,  the  competitive  ratio  for  a  given  input  is  the  ratio  of  the  length 
of  a  schedule  produced  by  the  on-line  algorithm  to  the  length  of  an  opti¬ 
mal  schedule.  The  competitive  ratio  of  an  on-line  zdgorithm  is  the  maximal 
competitive  ratio  over  all  inputs.  Equivalently,  we  say  that  a  scheduling  al¬ 
gorithm  is  (T -competitive  if  for  every  input  it  generates  a  schedule  which  is 
at  most  a  times  longer  than  the  optimal  schedule.  A  randomized  scheduling 
cdgorithm  is  <T-competitive  if  for  every  the  expected  length  of  the  schedule 
S  generated  by  the  algorithm  is  at  most  a  times  longer  than  the  optimal 
schedule  for  each  instance;  the  expectation  is  taken  over  the  random  bits  of 
the  scheduling  algorithm. 

3  Practical  motivation 

Our  model  is  motivated  by  scheduling  on  massively  parallel  machines  in  two 
different  scenarios. 

In  the  first  scenario,  the  massively  parallel  machine  is  running  in  a  batch 
mode  for  some  period  of  time.  All  jobs  from  different  users  are  submitted 
in  advance  and  their  degree  of  parallelism  is  known,  but  the  running  times 
might  be  unknown.  In  practice,  supercomputer  centers  typically  run  their 
computers  in  a  similar  scenario  for  at  leeist  some  part  of  the  time.  Typically 
this  is  used  for  computationally  intensive  jobs  with  limited  input  and  output, 
for  which  the  costs  that  we  neglect  are  relatively  small.  Thus  this  scenario 
matches  our  model  of  scheduling  without  dependencies  very  well.  (We  discuss 
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the  variation  of  this  scenario  when  jobs  are  not  given  in  advance  but  arrive 
at  some  fixed  times  in  Section  6.7.) 

In  practice,  the  scheduling  problems  are  solved  by  very  simple  heuristics. 
The  parallel  machine  is  typically  partitioned  into  a  few  partitions  (e.g,  a 
CM5  machine  with  512  processors  can  be  divided  into  5  partitions  of  256, 
128,  64,  32  and  32  processors).  Jobs  can  require  only  one  of  these  few  sizes, 
and  a  separate  queue  is  maintained  for  every  partition.  If  there  are  no  jobs 
for  one  partition,  its  processors  are  left  unused.  Jobs  requiring  the  whole 
machine  are  run  in  special  batches.  In  addition,  to  aid  in  scheduling,  users 
are  often  required  to  give  an  estimate  of  the  nmning  time  of  their  job.  While 
such  simple  heuristics  might  be  sufficient  for  the  size  and  load  of  the  parallel 
computers  today,  it  is  cle2ur  that  this  is  not  a  solution  that  can  satisfy  growing 
demands  in  the  future. 

In  the  second  scenario,  a  parallel  job  system  with  dependencies  can  be  a 
result  of  a  decomposition  of  a  large  task  into  subtasks  that  might  differ  in 
the  degree  of  parallelism  that  can  be  used  efficiently.  Such  a  decomposition 
can  be  provided  by  a  programmer,  or  it  could  be  obtained  automatically  by 
a  sophisticated  compiler.  Typically  in  such  a  decomposition  the  amount  of 
communication  between  the  separate  subtasks  is  small.  This  matches  our 
model  of  scheduling  with  dependencies. 

Current  research  and  methods  for  design  of  parallel  compilers  are  focused 
on  partitioning  the  processors  and  eissigning  them  to  the  subtasks  during  the 
compile  time:  this  eissignment  is  then  fixed  and  is  independent  of  the  input 
data  [Sar89,  Sub93.  SOG'''94].  This  approach  is  suitable  if  the  dependency  of 
the  running  times  on  the  input  is  small,  since  then  the  necessary  estimates  of 
the  running  times  of  the  subtasks  can  be  based  on  the  previous  performance. 
However,  for  many  applications  the  running  times  depend  significantly  on  the 
input.  To  achieve  a  satisfactory  performance  for  such  tasks,  it  is  necessary 
to  address  the  on-line  scheduling  problem. 


12 


4  Our  results 


For  most  network  topologies  we  prove  tight  or  very  close  lower  and  upper 
bounds  on  the  competitive  ratio,  both  with  and  without  dependencies. 

The  technically  most  difficult  results  are  the  bounds  for  randomized 
scheduling,  specifically  our  randomized  constcint-competitive  algorithm  for 
meshes  without  dependencies,  and  our  lower  bound  of 
domized  scheduling  on  one-dimensional  meshes  with  dependencies.  Of  the 
results  for  deterministic  scheduling,  the  most  difficult  are  the  lower  bound 
of  n(v^oglogiV)  for  scheduling  on  meshes  without  dependencies  and  the 
tradeoff  for  scheduling  on  PRAMs  with  dependencies  using  virtualization. 

In  all  the  results,  N  denotes  the  number  of  processors  of  the  machine. 


4.1  Scheduling  without  dependencies 

For  deterministic  on-line  scheduling  without  dependeacies  we  obtain  opti¬ 
mal  or  almost  optimal  algorithms  for  all  basic  network  topologies  that  we 
study,  PRAMs,  hypercubes,  one-  and  two-dimensional  meshes,  and  higher¬ 
dimensional  meshes  if  the  dimension  is  constant.  The  bounds  on  the  com¬ 
petitive  ratio  for  deterministic  scheduling  are  summarized  by  Table  1.  The 
lower  bound  of  2  —  '^ses  only  sequential  jobs  and  is  from  [SWW91];  we 
present  it  in  Section  7. 


Topology 

Upper  bound 

Lower  bound 

two-dim.  mesh 

0(v/loglog.V) 

Q{  v/log  log  .V) 

PRAM 

0-1. 

iV 

9  _  -L 

.V 

hypercube 

0-1. 

“  .V 

0-1. 

“  .V 

one-dim.  mesh 

2.5 

0-± 

“  .V 

d-dim.  mesh 

0 { d  log  d V log  log  N  -i-  {2d  log  d)'^) 

Table  1:  The  hounds  on  the  competitive  ratio  for  deterministic  on-line 
scheduling  without  dependencies. 
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If  randomization  is  allowed,  we  obtain  an  on-line  algorithm  for  scheduling 
on  d-dimensional  meshes  for  which  the  competitive  ratio  depends  only  on 
d.  Moreover,  under  some  restrictions  we  obtain  a  much  stronger  result,  an 
algorithm  with  competitive  ratio  that  does  not  even  depend  on  the  dimension 
of  the  mesh.  The  restrictions  are  that  the  dimensions  of  the  jobs  and  of  the 
mesh  are  powers  of  two,  and  all  the  dimensions  of  each  job  are  strictly  smaller 
than  the  corresponding  dimensions  of  the  mesh  (note  that  this  implies  that 
each  dimension  of  a  job  is  at  most  a  half  of  the  corresponding  dimension 
of  the  mesh,  since  all  dimensions  are  powers  of  two).  See  Table  2  for  the 
summary  of  these  results. 


Topology 

Restrictions 

Upper 

bound 

two-dim.  mesh 

none 

0(1) 

d-dim.  mesh 

dimensions  of  the  mesh  are  powers  of  two: 
dimensions  of  jobs  are  powers  of  two  and 
smaller  than  the  dimensions  of  the  mesh 

0(1) 

d-dim.  mesh 

none 

0(4'') 

Table  2:  The  bounds  on  the  competitive  lutio  for  randomized  on-line  schedul¬ 
ing  without  dependencies. 


4.2  Scheduling  with  dependencies 

For  scheduling  with  dependencies  we  study  separately  the  cases  with  and 
without  virtualization,  since  they  are  quite  different. 

For  deterministic  scheduling  we  also  study  the  dependence  of  the  com¬ 
petitive  ratio  on  the  size  of  the  largest  jol).  We  assume  that  .V  is  the  num¬ 
ber  of  processors  and  that  no  job  requests  more  than  \N  processors,  where 
0  <  A  <  1  is  a  constant.  For  scheduling  on  PRAMs  we  obtain  tight  tradeoffs 
between  A  and  the  optimal  competitive  ratio  in  both  cases,  with  and  without 
virtualization.  t 


With  virtualization,  we  obtain  an  optimal  algorithm  for  one-dimensional 
meshes  and  efficient  algorithms  for  hypercubes  and  higher-dimensional 
meshes.  Table  3  summarizes  the  results  on  deterministic  scheduling  with 
virtualization. 


Topology 

Restrictions 

Upper  bound 

Lower  bound 

PRAM 

none  (A  =  1) 

2  -f  ^ 

2-t-<^ 

PRAM 

0  <  A  <  1 

one-dim.  mesh 

none 

n(  \ 

“VlocloKiVl 

d-dim.  mesh 

none 

Of  \ 

hypercube 

none 

Table  3:  The  bounds  on  the  competitive  ratio  for  deterministic  on-line 
scheduling  with  dependencies  with  virtualization;  <p  =  (v/S  —  l)/2  «  0.618 
is  the  golden  ratio. 

Without  virtualization,  in  addition  to  the  tradeoff  for  PRAMs  mentioned 
above,  we  prove  that  no  efficient  scheduling  is  possible  if  the  size  of  the  jobs 
is  not  restricted.  See  Table  4  for  the  results. 


Topology 

Restrictions 

Upper  bound 

Lower  bound 

arbitrary 

none  (A  =  1) 

N 

.V 

arbitrary 

0  <  A  <  1 

PRAM 

0  <  A  <  1 

mtmm 

Table  4:  The  bounds  on  the  competitive  mtio  for  deterministic  on-line 
scheduling  with  dependencies  without  virtualization. 

Even  if  randomization  is  allowed,  we  cannot  use  it  efficiently  in  the  pres¬ 
ence  of  dependencies.  We  prove  a  lower  bound  showing  that  no  randomized 
algorithm  for  scheduling  on  one-dimensional  meshes  with  virtualization  has 
a  better  competitive  ratio  than  Thus  our  deterministic  algorithm 
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is  within  a  constant  factor  of  the  optimal  competitive  ratio.  We  also  prove 
that  with  randomization  still  no  efficient  scheduling  without  virtualization  is 
possible  if  the  size  of  jobs  is  not  restricted;  we  get  a  lower  bound  of  Nl2  in 
this  case.  See  Table  5  for  these  results. 


Topology 

Virtualization 

Lower  bound 

one-dim.  mesh 

allowed 

arbitrary 

not  2dlowed 

NI2 

Table  5:  The  bounds  on  the  competitive  ratio  for  randomized  on-line  schedul¬ 
ing  with  dependencies. 


4.3  Structure  of  dependencies 

In  addition  to  the  bounds  on  the  competitive  ratio  we  prove  that  for  on-line 
scheduling  with  dependencies  the  structure  of  the  dependency  graph  is  not  so 
important.  The  following  theorem  which  says  that  any  on-line  algorithm  for 
scheduling  job  systems  whose  dependency  graphs  are  trees  can  be  converted 
to  an  algorithm  for  scheduling  general  dependency  graphs  with  the  same 
competitive  ratio. 

Theorem  4.1  Let  an  on-line  scheduling  problem  with  dependencies  be  given 
(i.e.,  a  specific  architecture  and  simulation  factors).  Then  the  optimal  com¬ 
petitive  ratio  for  this  problem  is  equal  to  the  optimal  competitive  ratio  for 
a  restricted  problem  in  which  we  allow  only  job  systems  whose  dependency 
graphs  are  trees  as  inputs. 

This  theorem  is  easy  to  prove  for  fully  on-line  algorithms,  since  then 
the  algorithm  does  not  know  the  dependency  graph  in  advance.  The  same 
theorem  holds  even  for  general  on-line  algorithms,  where  the  algorithm  knows 
the  dependency  graph  and  all  job  types  before  the  scheduling  starts,  but  a 
more  sophisticated  argument  is  necessary. 
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4.4  Technical  assumptions 

All  our  lower  bound  results  assume  that  the  running  time  of  a  job  may  be 
zero.  This  is  only  a  convenient  technical  assumption  which  can  be  removed 
easily.  As  all  our  proofs  are  constructive,  we  simply  replace  all  zero  times 
by  unit  times  and  scale  2dl  other  running  times  to  be  sufficiently  large.  This 
only  decreases  the  lower  bounds  by  arbitrarily  small  additive  constants. 

The  lower  bounds  for  PRAMs  for  scheduling  with  dependencies  give  the 
best  constant  competitive  ratios  that  can  be  achieved  for  all  iV  if  A  is  fixed. 
There  is  a  small  additional  term  that  goes  to  0  as  iV  grows. 

4.5  History  of  the  problem 

The  model  of  deterministic  on-line  scheduling  of  parallel  jobs  without  de¬ 
pendencies  was  first  introduced  and  studied  in  a  joint  paper  with  Anja  Feld- 
mann  and  Shang-Hua  Teng  [FSTOlj.  That  paper  contains  the  results  of 
Theorem  10.1  from  Section  10,  Sections  11.1  to  11.3  and  Section  12.1  of  this 
thesis.  The  journal  version  [FST94)  of  the  paper  [FST91]  contains  all  the 
results  above  together  with  the  results  of  Sections  8.  9  and  the  rest  of  the 
Section  10. 

The  results  on  randomized  scheduling  without  dependencies  from  Sec¬ 
tions  11.4,  12.2  and  12.3  has  not  been  published  before.  Sections  11.1.  11.2 
and  12.1  are  substantially  revised  versions  of  the  material  from  the  pa¬ 
per  [FST91]. 

Deterministic  on-line  scheduling  with  dependencies  was  introduced  and 
studied  in  joint  papers  with  .Anja  Feldmann.  Ming- Yang  Kao  and  Shang- 
Hua  Teng  [FKST92,  FKST93].  Those  papers  contains  the  results  of  Sec¬ 
tions  13,  14.1,  15  and  16. 

The  results  on  randomized  scheduling  without  dependencies  from  Sec¬ 
tion  14.2  and  part  (iii)  of  Theorem  15.1  in  Section  15  has  not  been  published. 

The  claim  in  [FKST92,  FKST93)  that  the  G(  j;^^^;^)-competitive  algo¬ 
rithm  for  scheduling  on  hypercubes  is  optimal  is  wrong;  we  have  a  matching 
lower  bound  at  the  present  time. 
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5  Discussion  of  the  results 

Our  results  show,  not  surprisingly,  that  scheduling  with  dependencies  is  sig¬ 
nificantly  harder  than  scheduling  without  dependencies,  which  is  what  we 
expected.  What  is  somewhat  surprising  is  how  much  harder  it  is  in  some 
cases;  for  example  without  dependencies  we  have  a  2.5-competitive  algorithm 
for  one-dimensional  meshes,  while  with  dependencies  no  algorithm  for  one¬ 
dimensional  meshes  can  achieve  a  better  competitive  ratio  than 
even  if  both  randomization  and  virtualization  are  allowed. 

It  is  very  interesting  to  compare  the  influence  of  various  factors  on  the 
performance  of  scheduling  with  and  without  dependencies,  see  Table  6.  We 
exaimine  these  factors  one  by  one  in  the  rest  of  this  section. 


Factor 

Influence 

without  dependencies 

Influence 

with  dependencies 

randomization 

yes 

no 

network  topology 

small 

yes 

virtualization 

no 

yes 

size  of  the  jobs 

no 

yes 

Table  6:  Factors  influencing  the  performance  of  on-line  scheduling  algo¬ 
rithms. 

The  performance  of  scheduling  with  dependencies  depends  significantly 
on  virtualization,  and  the  maximal  size  of  a  job.  while  for  scheduling  with¬ 
out  dei)endencies  these  factors  are  not  very  important.  On  the  otlier  hand, 
randomization  does  not  help  much  for  scheduling  with  dependencies,  while 
it  significantly  improves  performance  of  sche<luling  without  dependencies  on 
meshes.  Network  topology  has  a  l)ig  influence  on  the  performance  of  schedul¬ 
ing  with  dependencies;  without  dependencies  the  changes  in  performance  are 
smaller,  but  for  more  complex  topologies  the  algorithms  are  significantly 
more  complex. 
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5.1  Randomization 


For  scheduling  with  dependencies  we  can  prove  that  the  randomization  does 
not  help,  and  the  optimal  competitive  ratio  for  the  one-dimensional  mesh  is 
still  which  is  achieved  by  a  deterministic  eilgorithm. 

On  the  other  hand,  without  dependencies,  we  can  use  randomization  to 
significantly  improve  the  competitive  ratio  for  scheduling  on  two-  amd  higher¬ 
dimensional  meshes.  If  the  dimension  of  the  mesh  is  constant,  the  optimal 
competitive  ratio  for  deterministic  algorithm  is  0(\/loglogiV),  while  our  ran¬ 
domized  algorithm  achieves  a  constant  competitive  ratio.  If  the  dimensions 
of  the  ma^diine  and  of  the  jobs  are  powers  of  two,  and  there  axe  no  large  jobs, 
the  competitive  ratio  does  not  even  depend  on  the  dimension  of  the  mesh. 
If  there  are  no  restrictions,  the  competitive  ratio  is  0(4"^),  which  is  still  sig¬ 
nificantly  better  than  the  deterministic  algorithm,  where  the  dependency  on 
d  is  0((2dlogd)‘^).  Note  that  in  practice  d  is  very  small,  typically  a  con¬ 
stant,  as  awbitrsurily  large  meshes  can  be  built  without  changing  d.  Hence  the 
competitive  ratio  is  not  that  large  even  if  the  dependency  on  d  is  exponential. 

To  achieve  such  a  strong  result,  we  estimate  for  each  job  size  the  total 
work  of  all  jobs  of  that  size  based  on  a  small  random  sample.  However,  it  is 
not  clear  how  this  can  lead  to  a  constant  competitive  ratio,  since  the  number 
of  different  job  sizes  depends  on  the  number  of  processors,  and  in  particular  it 
is  exponential  in  d.  We  use  Chemoff-Hoeffding  bounds  in  a  powerful  way  to 
prove  that  with  some  constant  probability  all  of  the  many  different  instances 
of  sampling  give  a  good  approximation  of  the  work. 

These  results  are  particularly  interesting  in  the  view  of  the  fact  that  the 
lower  bound  for  deterministic  algorithms  for  scheduling  on  one-dimensional 
meshes  with  dependencies  in  [FKST93]  and  the  lower  bound  for  deterministic 
algorithms  for  scheduling  on  two-dimensional  meshes  without  dependencies 
in  Section  11.3  both  use  a  very  similar  technique.  Yet  the  first  lower  bound 
generalizes  to  randomized  algorithms  and  the  other  one  does  not. 

It  is  also  interesting  that  randomization  is  used  in  our  algorithm  only  to 
randomly  permute  the  jobs  at  the  beginning  for  the  purpose  of  sampling. 
If  we  assume  that  the  usage  pattern  of  a  parallel  machine  does  not  change 
very  fast,  we  could  estimate  the  work  of  jobs  of  different  sizes  based  on  the 
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previous  usage  of  the  machine,  and  then  schedule  them  very  efficiently  even 
without  randomization.  This  is  actually  used  in  practice,  since  scheduling  is 
sometimes  done  manually  based  on  the  previous  experience  and  data.  Our 
result  explains  to  some  extent  why  it  might  be  very  useful  to  consider  the 
estimates  based  on  previous  experience.  If  these  estimates  are  good,  it  saves 
us  the  sampling  which  is  a  relatively  expensive  part  of  our  edgorithm. 

5.2  Network  topology  and  greedy  algorithms 

Our  results  show  that  the  complexity  of  the  network  topology  has  a  big 
influence  on  on-line  scheduling. 

On  the  more  complex  machines,  not  only  is  the  optimeil  performance 
lower,  but  also  the  algorithms  are  more  involved.  The  simplest  algorithms 
use  the  greedy  method,  which  means  that  they  schedule  any  available  job 
as  soon  as  its  resource  requirements  can  be  satisfied.  This  is  optimal  for 
PRAMs,  both  with  and  without  dependencies,  but  as  the  network  topology 
gets  more  complex,  the  performance  of  greedy  methods  decreases  because 
they  tend  to  scatter  the  available  processors  and  hence  make  them  unusable 
for  larger  jobs. 

This  is  particularly  clear  without  dependencies.  Already  for  hypercubes 
and  one-dimensional  meshes  we  need  to  modify  the  greedy  approach.  In  our 
algorithms  we  schedule  the  largest  job  whenever  processors  are  available,  as 
in  a  greedy  algorithm,  but  if  it  is  impjossible  to  schedule  the  largest  job, 
all  smaller  jobs  are  postponed  as  well,  even  if  they  could  be  scheduled  im¬ 
mediately.  For  the  two-dimensional  mesh  we  have  to  abandon  the  greedy 
approach  completely.  The  optimal  algorithm  begins  by  using  only  a  small 
fraction  of  the  mesh  without  even  attempting  to  use  the  whole  mesh,  and 
continues  with  a  larger  fraction  only  when  the  available  processors  in  the  cur¬ 
rent  fraction  become  unusable.  The  proof  of  the  lower  bound  actually  shows 
that  this  is  not  an  arbitrary  choice — no  greedy-like  algorithm  can  achieve 
a  substantially  better  competitive  ratio  than  n(loglog  TV).  This  shows  that 
the  general  heuristic  of  using  greedy  algorithms  for  scheduling  can  be  wrong 
despite  the  fact  that  it  works  well  m  many  previous  scheduling  algorithms, 
see  for  example  [Gra66,  LST90,  SWW91]. 


20 


In  the  C2use  without  dependencies  we  can  still  keep  the  competitive  ratio 
const2Lnt  by  using  more  complex  algorithms  and  randomization  for  all  ar¬ 
chitectures  including  the  higher-dimensional  meshes.  However,  there  is  still 
some  difference  in  the  performance  since  the  constants  are  higher  for  more 
complex  topologies. 

On  the  other  hand,  with  dependencies  the  performance  decreases  sharply 
with  the  increasing  complexity  of  the  machine.  Intuitively,  the  reason  is  that 
algorithms  for  scheduling  with  dependencies  have  to  be  able  to  deal  with  the 
jobs  that  become  available  later  during  the  schedule,  and  hence  we  do  not 
have  the  option  of  abandoning  the  greedy  approach  completely. 

5.3  Virtualization 

Without  dependencies,  virtualization  does  not  help  us  to  improve  the  per¬ 
formance  of  our  algorithms.  In  fact,  all  our  algorithms  for  scheduling  with¬ 
out  dependencies  do  not  use  virtualization  but  are  competitive  even  against 
schedules  that  use  virtualization.  ,411  our  lower  bounds  for  scheduling  without 
dependencies  are  valid  even  for  algorithms  that  use  virtualization. 

This  is  no  longer  true  in  the  presence  of  dependencies.  The  main  reason 
for  this  distinction  is  that  without  dependencies  we  can  process  all  large 
jobs  first.  But  with  dependencies,  jobs  requiring  the  whole  machine  can  be 
dependent  on  other  jobs.  When  they  become  available,  other  jobs  may  be 
running  and  we  have  to  wait  until  all  or  most  of  them  finish  to  be  able  to 
satisfy  the  requirements  of  the  large  jobs.  This  causes  an  inelfiriency  which 
can  be  avoided  only  by  the  use  of  virtualization. 

This  is  clearly  demonstrated  by  the  results  on  scheduling  with  dependen¬ 
cies  when  virtualization  is  prohibited.  If  we  allow  jobs  requiring  the  whole 
machine,  no  efficient  scheduling  is  possible  on  any  machine,  even  using  ran¬ 
domization.  If  we  restrict  the  number  of  proces.sors  that  a  job  can  request 
to  some  constant  fraction  of  the  machine,  the  situation  improves  somewhat, 
but  the  competitive  ratio  is  still  significantly  larger  than  with  analogous  re¬ 
striction  on  the  size  of  jobs  and  virtualization  allowed. 

All  our  algorithms  use  virtualization  only  for  large  jobs,  or  can  be  modi¬ 
fied  to  do  that.  This  support  the  intuition  that  the  large  jobs  are  the  main 
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problem,  which  makes  scheduling  with  dependencies  imix>ssible  without  vir¬ 
tualization. 


5.4  The  size  of  the  jobs 

With  virtualization,  efficient  scheduling  is  possible  even  with  no  restriction 
on  the  size  of  jobs.  However,  even  then  the  competitive  ratio  depends  on  the 
maximal  allowed  size  of  jobs. 

Our  tight  tradeoffs  between  the  maocimal  size  of  a  job  and  the  optimal 
competitive  ratio  for  deterministic  scheduling  on  PRAMs  with  dependencies 


both  with  and  without  virtualization 


Figure  1:  The  relation  between  A  and 
the  competitive  ratio  for  PRAMs.  us¬ 
ing  virtualization 


ase  illustrated  on  Figures  1  and  2. 


Figure  2;  The  relation  between  A  and 
the  competitive  ratio  for  PRAMs. 
without  using  virtualization. 


If  the  size  of  all  jobs  is  a  small  fraction  of  the  total  number  of  proces¬ 
sors,  the  competitive  ratio  is  in  both  cases  close  to  2.  which  is  the  optimal 
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competitive  ratio  even  if  we  allow  only  sequential  jobs  and  no  dependen¬ 
cies.  If  the  maximal  size  of  jobs  increases,  the  competitive  ratio  increases 
to  2  -H  ^  «  2.618  if  virtualization  is  allowed,  but  with  no  virtualization  it  is 
unbounded. 


6  Previous  models  and  results 

6.1  Off-line  scheduling  and  approximation  algorithms 

First,  let  us  point  out  that  the  exact  solution  of  off-line  scheduling  problems 
is  NP-hard  even  for  very  simple  special  cases  involving  only  sequential  jobs, 
see  [GJ79].  For  parallel  jobs,  Blaiewicz,  Drabowski  and  W§glarz  [BDW86] 
proved  that  if  the  number  of  processors  is  a  part  of  the  input,  optimal  schedul¬ 
ing  is  NP-complete.  Later  Du  and  Leung  [DL81]  improved  this  result  and 
showed  '  hat  optimal  scheduling  is  NP-hard  for  scheduling  on  two  processors 
with  dependencies  and  five  processors  without  dependencies. 

Since  optimal  scheduling  is  hard,  there  have  been  a  significant  interest 
in  approximation  algorithms.  However,  it  is  NP-complete  even  to  decide  if 
a  set  of  sequential  jobs  with  unit  running  times  and  with  dependencies  has 
a  schedule  of  length  three  on  a  given  number  of  processors  [LRK78].  This 
result  implies  that  it  is  NP-hard  to  approximate  the  length  of  an  optimal 
schedule  for  sequential  jobs  within  a  factor  better  than  4/3.  For  parallel 
jobs,  it  is  easy  to  see  that  even  deciding  if  a  set  of  jobs  with  unit  running 
times  without  dependencies  has  a  schedule  of  length  two  is  as  hard  as  NP- 
complete  problem  PARTITION,  and  hence  it  is  NP-hard  to  approximate  the 
length  of  an  optimal  schedule  within  a  factor  better  than  3/2 

Approximation  algorithms  are  related  to  this  thesis,  since  every  on-line 
algorithm  is  also  an  approximation  algorithm,  but  the  converse  is  not  neces¬ 
sarily  true.  Our  on-line  algorithm  have  similar  or  even  better  performance 
than  the  previously  known  approximation  algorithms.  Most  previous  results 
on  scheduling  parallel  jobs  do  not  consider  the  network  topology  and  thus  in 
our  model  they  are  valid  only  for  scheduling  on  PR.AMs. 

The  results  of  Graham  [Gra66]  on  list  scheduling  give  an  approximation 
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algorithm  with  an  approximation  ratio  of  (2  —  for  scheduling  of  sequential 
jobs  on  N  processors  without  dependencies.  This  algorithm  is  in  fact  on-line 
and  can  be  easily  generalized  for  parallel  jobs,  which  gives  our  result  in 
Section  8.  A  similar  result  appears  also  in  [TWY92]. 

Wang  and  Cheng  [WC92]  give  an  approximation  algorithm  for  scheduling 
on  PRAMs  with  dependencies  which  achieves  approximation  ratio  of  3.  Their 
algorithm  is  not  on-line.  We  improve  this  result  in  Section  13  by  presenting 
an  on-line  algorithm  which  achieves  a  competitive  ratio  of  approximately 
2.618,  which  is  optimal  for  on-line  algorithms. 

There  have  been  some  results  on  approximation  algorithms  for  scheduling 
of  independent  jobs  on  hypercubes  and  one-dimensional  meshes.  Chen  and 
Lai  [CL88]  give  an  algorithm  for  hypercubes  that  achieves  the  approximation 
ratio  of  2  —  Their  algorithm  is  not  on-line,  however,  it  is  very  similar  to 
our  on-line  algorithm  from  Section  9  which  achieves  the  optimal  competitive 
ratio  of  2  — 

Scheduling  on  one-dimensional  meshes  without  dependencies  is  equivalent 
to  packing  two-dimensional  rectangles  into  a  two-dimensional  bin  so  that  the 
total  height  is  as  small  as  possible.  For  a  long  time  the  best  known  algorithm 
was  by  Sleator  [Sle80]  which  achieves  an  approximation  ratio  of  2.5.  This 
algorithm  is  not  on-line,  and  our  2.5-competitive  algorithm  in  Section  10 
is  very  different.  Recently  Steinberg  [Ste93]  obtained  an  algorithm  with  an 
approximation  ratio  of  2.  This  algorithm  uses  the  information  about  running 
times  in  a  crucial  way  and  hence  it  is  not  on-line. 

6.2  Computational  complexity  of  on-line  algorithms 

The  competitive  ratio  does  not  measure  the  amount  of  computation  done  by 
the  on-line  algorithms  in  any  e.xplicit  way.  However,  typically  the  algorithms 
that  are  good  from  the  viewpoint  of  competitive  analysis  are  also  simple  and 
do  not  require  a  large  amount  of  computation. 

This  is  true  in  our  case.  The  most  complex  operation  done  by  our  schedul¬ 
ing  algorithms  is  to  sort  the  jobs  according  to  their  size.  This  can  be  done 
very  fast  in  parallel,  since  the  number  of  different  sizes  is  at  most  propor¬ 
tional  to  the  number  of  processors.  Once  the  jobs  are  sorted,  they  are  either 
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processed  in  a  predetermined  order,  or  the  different  sizes  are  treated  inde¬ 
pendently,  and  no  significant  amount  of  computation  is  needed. 

The  algorithms  can  also  be  executed  in  a  distributed  environment,  since  it 
is  possible  to  implement  them  with  a  very  limited  amount  of  communication. 

6.3  Emphasis  on  network  topology 

The  new  feature  of  our  model  is  that  we  study  on-line  scheduling  on  concrete 
network  topologies.  Only  a  small  fraction  of  previous  work  on  scheduling  of 
parallel  jobs  takes  into  account  the  network  topology  of  the  machine  [SleSO, 
CL88,  TWY92];  in  all  cases  the  focus  is  on  approximation  algorithms  of 
independent  jobs.  The  rest  of  previous  literature  on  scheduling  is  concerned 
only  with  the  number  of  processors  required  by  a  job,  not  with  the  constraints 
of  the  concrete  network  topology;  thus  from  our  point  of  view,  these  results 
apply  only  to  the  simplest  network  topology,  PRAMs. 

Various  network  topologies  have  been  studied  extensively,  but  with  a 
different  emph<isis.  A  typical  questions  in  this  area  is  whether  it  is  pos¬ 
sible  to  simulate  one  network  by  another  network  (either  with  a  different 
topology,  or  with  a  different  size),  and  how  efficient  the  simulation  can 
be  [BCH'''88,  KA86,  Ran87].  Such  results  are  closely  related  to  the  tech¬ 
nique  of  virtualization.  Virtualization  is  possible  only  if  the  simulation  of 
the  given  network  by  a  smaller  one  can  be  efficient.  Therefore  the  most  im¬ 
portant  aspect  of  these  results  for  us  is  that  they  justify  the  technique  of 
virtualization  for  simple  cases  and  study  its  limits  for  more  complex  network 
topologies. 

6.4  Virtualization 

Virtualization  is  possible  on  practical  systems  with  a  regular  topology,  pos¬ 
sibly  with  some  additional  costs  that  we  ueglected.  These  costs  are  small 
if  the  parallel  job  uses  parallelism  to  reduce  its  running  time,  and  other  re¬ 
sources  are  not  critical.  In  some  cases  a  job  might  need  some  amount  of 
parallelism  to  satisfy  other  resource  requirements.  For  example,  it  might  be 
memory-intensive  job,  and  if  there  is  a  limited  amount  of  memory  available 
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to  each  processor  (which  is  usually  the  case),  we  have  to  schedule  it  on  a 
sufficient  number  of  processors  to  satisfy  its  memory  requirements.  In  such 
a  case  virtualization  is  not  available  at  all  or  only  to  a  limited  degree.  Since 
our  algorithms  use  virtualization  only  for  large  jobs,  our  results  may  still 
apply  or  be  modified  for  the  particular  application. 

We  assume  that  virtualization  preserves  the  work.  In  other  words,  this 
means  that  up  to  a  certain  number  of  processors  a  par2dlel  job  scales  perfectly 
and  the  running  time  decreases  proportionally,  and  after  that  increasing  the 
number  of  processors  does  not  improve  the  nmning  time.  Examples  given 
in  [SOG'’'94]  show  that  this  approximation  closely  matches  the  behavior  of 
many  tasks  encountered  in  practice. 

For  approximation  algorithms  some  research  heis  been  done  under  the 
assumption  that  virtualization  does  not  preserve  the  work  [TWY92].  In 
the  off-line  setting,  an  approximation  algorithm  can  use  the  full  information 
about  the  running  time  of  a  job  on  any  number  of  processors.  In  particular, 
it  can  find  the  number  of  processors  such  that  the  work  of  the  job  is  minimal, 
which  is  essential  for  efficient  scheduling.  In  the  on-line  case,  we  assume  no 
knowledge  of  the  running  times.  If  we  do  not  even  know  how  the  work  of 
the  job  depends  on  the  number  of  processors,  then  no  efficient  scheduling  is 
possible. 

6.5  Speed  of  processors 

In  our  model  we  require  that  all  processors  run  with  the  same  speed.  Our 
motivation  of  massively  parallel  computation  assumes  one  parallel  machine 
for  processing  parallel  jobs  with  possibly  high  amount  of  communication  be¬ 
tween  the  processors  and  such  parallel  machines  are  designed  with  processors 
of  the  same  speed.  It  is  difficult  to  imagine  a  parallel  job  that  can  run  on  pro¬ 
cessors  of  possibly  different  speeds,  since  in  such  cases  it  is  usually  possible 
and  more  efficient  to  break  up  the  job  into  sequential  jobs  with  dependencies. 

In  the  case  of  sequential  jobs,  processors  of  different  speeds  are  motivated 
by  the  environment  in  which  many  sequential  machines  (possibly  different) 
are  connected  by  a  communication  network.  This  model  is  outside  the  scope 
of  this  work.  It  leads  to  interesting  problems  studied  for  example  in  [LST90, 
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SWW91,  ST93]. 


6.6  Preemption 

In  our  model,  it  often  happens  that  many  processors  are  available  but  not 
in  the  requested  configuration.  If  we  eire  allowed  to  preempt  a  job,  i.e.,  to 
move  a  running  job  to  different  processors,  or  at  least  to  restart  a  job  on 
different  processors,  we  can  reorganize  the  running  jobs  periodically  so  that 
the  unused  processors  can  be  used  more  efficiently. 

But  on  massively  parallel  machines,  the  cost  of  context  switching,  which 
has  to  be  used  for  preemptions  and  restarts  is  very  expensive.  The  main 
reason  is  that  moving  the  data  and  the  current  state  of  the  memory  requires 
high  amount  of  communication,  which  can  interfere  with  execution  of  other 
parallel  jobs.  If  we  allow  preemptions,  these  costs  can  be  incurred  repeat¬ 
edly,  and  thus  cannot  be  included  in  the  running  time  of  the  job.  If  the 
preemptions  occur  often,  the  additional  costs  are  very  high.  The  cidditional 
costs  of  restarting  a  job  are  similar,  and  in  addition,  some  jobs  might  not 
allow  restarting  at  all.  For  these  reasons  we  consider  our  choice  of  the  model 
without  preemption  and  restarts  more  reasonable. 

For  scheduling  sequential  jobs,  preemptive  on-line  scheduling  is  studied 
for  example  in  [MPT93,  SWW91]. 

6.7  Fixed  release  times 

We  assume  that  a  job  can  be  executed  immediately  unless  it  is  dependent 
on  other  jobs.  However  in  practice  a  job  may  become  available  at  a  fixed 
time,  independent  of  other  jobs.  This  is  a  motivation  for  a  model  with  release 
times,  considered  for  example  in  [GJ79,  SWW91].  In  this  model  each  job  has 
a  fixed  release  time,  and  it  cannot  be  executed  earlier.  This  might  appear 
to  be  a  natural  and  powerful  extension  of  our  model  of  scheduling  without 
dependencies. 

We  have  two  reasons  showing  that  we  do  not  need  to  incorporate  the  re¬ 
lease  times  explicitly.  First,  the  effect  of  the  fixed  release  times  can  be  easily 
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simulated  in  the  model  with  dependencies.  Second,  a  very  general  obser¬ 
vation  of  [SWW91]  shows  that  even  in  the  model  without  dependencies,  an 
on-line  problem  with  release  times  is  not  much  more  difficult  than  the  corre¬ 
sponding  problem  without  release  times.  The  intuition  is  that  before  the  last 
job  is  released,  nothing  important  can  happen,  because  the  optimal  schedule 
has  to  contain  that  time  interval  as  well.  Therefore  if  we  start  scheduling 
only  after  the  last  job  has  arrived,  we  lose  only  a  small  constant  faM:tor.  Of 
course,  we  do  not  know  which  job  is  the  last  one,  but  this  is  a  minor  tech¬ 
nical  problem  which  can  be  solved  by  scheduling  the  jobs  in  several  batches 
as  follows.  We  first  schedule  all  jobs  available  at  the  beginning,  disregarding 
the  newly  arrived  jobs.  After  all  of  them  are  finished,  we  schedule  all  jobs 
available  at  that  point,  and  so  on.  Using  this  approach,  the  competitive  ra¬ 
tio  is  only  twice  as  much  as  the  competitive  ratio  for  the  cdgorithm  used  to 
schedule  the  individual  batches. 

6.8  Performance  measures 

Intuitively,  the  competitive  ratio  can  be  interpreted  ds  a  value  that  measures 
the  cost  of  the  information  about  the  future,  in  our  case  the  information 
about  the  running  times.  We  pay  for  not  having  the  information  about  the 
running  times  of  jobs  in  advance  by  decreasing  our  efficiency  by  a  factor 
which  is  at  most  the  competitive  ratio.  Thus,  given  an  algorithm  with  a 
small  competitive  ratio,  we  know  that  any  schedule  which  it  produces  is 
close  to  the  optimal  one. 

In  both  our  models,  with  and  without  dependencies,  the  length  of  the 
schedule  and  the  competitive  ratio  seem  to  be  good  and  realistic  performance 
measures. 

In  the  model  with  fixed  release  times  mentioned  above,  using  the  total 
length  of  the  schedule  as  a  performance  measure  is  somewhat  questionable. 
The  competitive  ratio  (with  respect  to  the  length  of  the  schedule)  is  still 
relevant  to  get  a  rough  picture.  As  we  will  see  later,  to  achieve  a  competitive 
ratio  of  <t,  the  algorithm  typically  has  to  maintain  at  least  I /a  of  the  pro¬ 
cessors  busy  most  of  the  time.  Hence  if  a  is  the  optimal  competitive  ratio, 
and  the  total  work  of  the  jobs  is  1  /<7  fraction  of  the  work  that  can  be  done 
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by  the  machine,  it  is  possible  to  schedule  all  jobs;  otherwise  the  backlog  of 
unscheduled  jobs  and  the  response  time  necessarily  increase  with  time. 

However,  since  the  model  of  hxed  release  times  is  motivated  mostly  by  an 
interactive  environment  in  which  users  can  submit  their  jobs  at  any  time,  to 
get  more  accurate  and  insightful  results,  it  would  be  essential  to  consider  dif- 
ferent  measures  of  performance,  such  as  the  response  time.  Average  response 
time  is  considered  for  several  variants  of  on-line  scheduling  of  sequential  jobs 
in  [MPT93].  The  work  of  [TSWY94]  considers  approximate  off-line  schedul¬ 
ing  of  parallel  jobs  with  respect  to  average  response  time.  On-line  scheduling 
of  parallel  jobs  with  average  response  time  as  the  performance  measure  is  an 
interesting  area  for  further  research  outside  the  scope  of  this  work. 

7  Notation  and  basic  techniques 

For  a  given  job  system  J  with  or  without  dependencies,  we  denote  the  length 
of  an  optimal  schedule  by  ToptiJ)-  By  T{S)  we  denote  the  length  of  the 
schedui?  5,  i.e.,  the  time  when  the  last  job  finishes  (its  makespan).  A  de¬ 
terministic  scheduling  algorithm  is  cr-competitive  if  for  every  job  system  J 
the  schedule  S  generated  by  the  algorithm  satisfies  T[S)  <  crTopt{<J)-  A 
randomized  scheduling  algorithm  is  cr-competitive  if  for  every  job  system  J 
the  expected  length  of  the  schedule  S  generated  by  the  algorithm  satisfies 
E[T'(5)]  <  crToptljr),  where  the  expectation  is  taken  over  the  random  bits  of 
the  scheduling  algorithm. 

For  proving  lower  l)ouruls  on  the  competitive  ratio  it  is  useful  to  interpret 
the  problem  as  a  game  between  the  scheduling  algorithm  and  the  adversary. 
In  the  deterministic  case  the  adversary  chooses  the  number  of  jobs,  their  job 
types  and  dependencies  in  advance.  Then  the  on-line  scheduling  algorithm 
starts  scheduling  them,  while  the  adversary  has  complete  control  over  the 
running  times  of  the  jobs:  he  can  stop  any  job  at  any  time.  To  prove  the  lower 
bound,  we  simulate  this  game,  and  then  use  the  job  system  with  running 
times  as  specified  by  the  adversary  during  the  game  as  an  input  for  the 
scheduling  algorithni.  The  actions  of  the  scheduling  algorithm  are  the  same, 
as  it  is  deterministic  and  on-line. 
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For  randomized  algorithms  we  can  define  different  modeb  depending  on 
the  strength  of  the  adversary,  see  [BDBK‘*'90].  Our  model  corresponds  to 
an  oblivious  adversary,  which  means  that  the  adversary  has  no  access  to 
the  random  bits  of  the  scheduling  algorithm.  This  makes  the  adversary 
significantly  less  powerful  than  in  the  deterministic  case. 

To  analyze  the  competitive  ratio  of  our  algorithms,  we  need  to  bound  the 
length  of  the  optimsJ  schedule.  We  use  the  following  two  lower  bounds  for 
the  optimal  schedule. 

First,  the  optimal  schedule  has  to  schedule  the  jobs  that  2ure  dependent 
on  each  other  sequentially.  For  a  given  job  system  J,  TmmxiJ)  denotes 
the  maximal  sum  of  running  times  of  jobs  along  any  path  in  the  dependency 
graph;  for  a  job  system  without  dependences  this  is  just  the  maximal  running 
time  of  a  job.  Clearly  TjotxiJ)  <  Topt(J)  because  an  optimal  algorithm  has 
to  schedule  these  jobs  sequentially;  even  if  we  use  virtualization,  their  running 
time  cannot  get  shorter. 

Second,  the  optimal  schedule  has  to  perform  all  the  work  of  all  jobs.  For 
a  given  job  J  €  J.  the  work  of  ./,  denoted  work(  J),  is  the  product  of  the 
number  of  processors  it  requests  and  its  running  time.  We  define  Tea{J)  to  be 
the  time  required  to  perform  the  work  of  all  jobs  on  N  processors,  where  N  is 
the  number  of  processors  of  the  given  machine.  =  Tlj^j'work{J)/N . 

Clearly  T^«{J)  ^  Topt(<J)-  Using  virtualization  does  not  influence  this  fact, 
because  we  assume  that  virtualization  preserves  the  work,  i.e.,  if  a  job  is 
scheduled  on  a  smaller  number  of  proces.sors  (with  a  given  network  topology, 
see  Section  2.2),  its  running  time  is  proportionally  larger. 

We  omit  the  argument  J  of  T„pi{J).  Tm&AJ)  or  if  it  is  clear  from 

the  context  which  job  system  we  are  referring  to. 

To  analyze  scheduling  algorithms,  the  concept  of  efficiency  is  very  im¬ 
portant.  Lemma  7.1  shows  that  if  a  schedule  has  high  efficiency  except  for 
a  short  period  of  time,  then  the  schedule  cannot  be  much  longer  than  the 
optimal  one. 

The  efficiency  of  a  set  C  of  currently  running  jobs,  denoted  by  e(f(C),  is 
defined  as  the  total  number  of  processors  assigned  t  -  a  job  in  C  divided  by 
N.  The  efficiency  of  a  schedule  at  time  t  is  the  efficiency  of  the  set  of  all  jobs 
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running  at  time  t,  or  equivalently  the  number  of  processors  busy  at  time  t 
divided  by  N.  The  efficiency  with  respect  to  any  subgraph  of  the  machine 
is  defined  similarly.  Efficiency  of  an  algorithm  refers  to  the  efficiency  of  the 
schedule  generated  by  that  algorithm  (for  a  particular  job  system);  if  this 
algorithm  is  used  only  to  schedule  jobs  on  a  part  of  the  machine,  it  refers  to 
the  efficiency  with  respect  to  that  part  of  the  machine. 

For  a  schedule  5  and  an  a  <  1,  T<^a{S)  denotes  the  total  time  during 
which  the  efficiency  of  S  is  less  than  a.  Our  symbols  are  summarized  in 
Table  7. 


Symbol 

Explanation 

ro«(J) 

The  length  of  an  optimal  schedule  for  a  job  system  J 

Tn^iJ) 

The  maximal  sum  of  running  times  along  any  path  in 
the  dependency  graph  of  a  job  system  J 

TMJ) 

Time  required  to  perform  the  total  work  of  all  jobs  in 
a  job  system  J 

T{S) 

The  length  of  a  schedule  5 

T<.{S) 

The  total  time  during  a  schedule  5  when  the  efficiency 
is  less  than  a 

Table  7:  Table  of  symbob. 


Lemma  7.1  Let  S  be  a  schedule  for  a  job  system  J  (with  or  without  depen¬ 
dencies)  such  that  the  work  of  each  job  is  preserved. 

(i)  Let  a  <  I,  0  >  0.  Suppose  that  r<a(5)  <  0Topt(J).  Then  T(S)  < 

i0+^)Top.{J). 

(ii)  Let  0  <  Oi  <  0:2  <  1.  >  0-  Suppose  that  the  efficiency  of  the 

schedule  S  is  at  least  ai  at  all  times  and  T<oj(5)  <  0Topt(J)-  Then  T{S)  < 

Proof,  (i)  is  a  trivial  consequence  of  (ii)  for  qi  =  0.  (ii)  The  optimal 
2Jgorithm  has  to  do  at  least  the  same  amount  of  work  as  S,  because  in  S  the 
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work  of  each  job  is  preserved.  Hence 

Teff  >  ajT(5)  -  (oj  -  ai)7’<„,(5). 

— {Teg  +  (Oj  —  Oi)T<aa(5)) 

Oj 

i-(l  +  (a,  -  a,)^)T„.  =  +  1:1^’)  T^. 

02  \  02  J 

□ 

Another  ieinma,  which  is  useful  for  analysis  of  scheduling  of  job  systems 
with  dependencies,  is  from  [Gra66].  We  use  it  to  bound  the  time  when  the 
efficiency  is  low.  Such  a  bound  makes  it  possible  to  apply  Lemma  7.1. 

Lemma  7.2  (Graham,  1960)  Let  S  be  a  schedule  for  a  job  system  J  with 
dependencies.  Then  there  exists  a  path  of  jobs  in  the  dependency  graph  such 
that  whenever  there  is  no  job  available  to  be  scheduled,  some  job  on  that  path 
is  running. 

Proof.  Let  Jo  be  the  job  that  finishes  Isist.  Let  be  the  last  time  before 
Jo  is  started  at  which  no  job  is  avciilable.  Then  there  is  a  job  Ji  running  at 
the  time  ti  that  is  an  ancestor  of  Jo  in  the  dependency  graph,  as  otherwise 
Jo  would  already  be  available  at  to.  By  the  same  method  construct  J2, 
ts,  J3,  •  ■ . ,  tk,  Jk,  until  there  is  no  time  with  no  job  available  before  Jk  is 
started.  Because  of  the  way  we  selected  the  jobs,  Jk,  Jk-i,- . . ,  Jo  is  a  path  in 
the  dependency  graph  and  one  of  the  jobs  Jk,  Jk-i,-  •  • ,  Jo  is  running  at  any 
time  when  no  job  is  available.  □ 

Now  we  present  the  example  of  Shmoys,  Wein  and  Williamson  [SWW91] 
which  proves  a  lower  bound  of  2  —  ^  on  a  competitive  ratio  for  scheduling 
on  any  machine  of  N  processors.  This  lower  bound  uses  only  sequential 
jobs,  and  hence  it  is  valid  for  any  network  topology.  Take  a  job  system  of 
N{N  —  1)  +  1  sequential  jobs.  The  adversary  assigns  running  time  1  to  all 
jobs  except  to  the  job  that  is  started  last  by  the  on-line  algorithm;  to  the  last 


Therefore 


T{S)  < 
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job  he  assigns  time  N.  The  length  of  the  schedule  generated  by  the  on-line 
algorithm  is  at  least  2N  —  1.  The  optimal  schedule  takes  time  N,  and  hence 
the  competitive  ratio  is  at  least  2  — 

For  a  particular  architecture  it  is  usually  possible  to  specify  the  graph 
requested  by  a  job  by  a  few  parameters.  We  represent  jobs  on  PRAMs  as 
where  p  is  the  requested  number  of  processors  and  t  is  the  running 
time  on  p  processors.  Jobs  on  hypercubes  are  represented  as  where  d 

is  the  dimension  of  the  requested  hypercube  and  t  is  the  running  time.  Jobs 
on  d-dimensional  meshes  are  represented  as  (oi, . . . ,  a^,  t),  meaning  that  the 
requested  graph  is  a  mesh  of  size  at  x  •  •  •  x  a,j  and  t  is  the  running  time;  we 
always  assume  without  loss  of  generality  that  oi  >  . . .  >  aj.  Of  course,  the 
on-line  algorithms  do  not  know  the  running  times. 

We  write  our  algorithms  in  an  easy  to  understand  pseudocode.  The  in¬ 
struction  “wait”  means  that  the  algorithm  waits  until  all  currently  running 
jobs  are  scheduled.  We  say  that  a  processor  is  available,  if  it  is  currently  not 
assigned  to  any  job.  A  job  is  available,  if  all  its  predecessors  in  the  depen¬ 
dency  graph  are  finished,  and  it  was  not  scheduled  yet.  Note  that  without 
dependencies  any  unscheduled  job  is  available.  On  the  other  hand,  with  de¬ 
pendencies  it  might  happen  that  no  jobs  are  available,  but  at  some  later  time 
there  will  be  available  jobs,  namely  those  that  are  dependent  on  the  currently 
running  jobs.  Thus  we  can  be  sure  that  all  jobs  have  been  scheduled  only 
when  no  jobs  are  available  and  no  jobs  are  running. 

In  some  algorithms  we  do  not  specify  e.xactly  which  job  should  be  sched¬ 
uled  next.  In  that  case  any  available  job  (satisfying  given  constraints,  if 
there  are  any)  can  be  scheduled,  and  our  l)ounds  are  true  for  any  such  im¬ 
plementation  of  the  algorithm.  .Sometimes  we  require  the  machine  to  be 
partitioned  into  several  subgraphs.  It  is  understood  that  these  subgraphs 
have  to  be  disjoint;  possibly  they  do  not  cover  the  whole  machine  (usually 
due  to  rounding). 

By  “size”  and  “large”  we  refer  to  the  number  of  processors  requested  by 
a  job,  while  “length”  and  “long”  refers  to  its  time.  This  might  be  especially 
confusing  for  one-dimensional  mesh  meichines;  we  use  “segment”  for  a  con¬ 
nected  part  of  the  machine  or  of  the  real  line,  while  “interval”  is  reserved  for 
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time  intervals. 

In  reference  to  processor  requirements  of  jobs,  we  use  “require”  if  virtual¬ 
ization  is  not  allowed,  otherwise  we  use  “request”  to  indicate  that  a  job  can 
be  scheduled  on  a  smaller  number  of  processors. 

The  formula  ajbc  always  means  the  same  as  a/{bc).  All  logarithms  are  in 
base  2,  unless  specified  otherwise. 


Part  II 


Scheduling  parallel  jobs  with 
no  dependencies 

8  PRAMs 

In  this  section  we  present  an  optimal  (2  —  j^)-competitive  algorithm  for  on¬ 
line  scheduling  on  PRAMs  without  dependencies.  This  is  tight  due  to  the 
general  lower  bound  presented  in  Section  7. 

This  result  is  essentiadly  a  generalization  of  Gr<diam’s  results  on  list 
scheduling  of  sequential  jobs  [Gra66|  for  parallel  jobs. 

This  algorithm  uses  the  natural  greedy  approach.  It  schedules  an  arbi- 
traury  job  if  sufficiently  many  processors  are  available.  It  does  not  matter 
which  job  we  choose,  as  long  as  we  always  schedule  some  job  as  soon  as  pos¬ 
sible.  As  we  will  see  later,  this  is  not  true  for  more  complex  architectures. 

Algorithm  GREEDY 

while  there  is  an  unscheduled  job  J  do 

if  some  job  J  requires  p  processors  and  p  processors  are  available, 
then  schedule  ./  on  the  p  processors; 
wait. 

Theorem  8.1  The  algorithm  GREEDY  is  (2  —  ■^) -competitive  for  a  PRAM 
with  N  processors. 

Proof.  Suppose  that  the  algorithm  generates  a  schedule  of  length  T  for  a 
job  system  Let  p  be  the  minimal  number  of  busy  processors  during  the 
entire  schedule.  Consider  the  last  time  r  when  only  p  processors  were  busy. 
Let  J  be  some  job  running  at  that  time.  Before  J  is  scheduled,  there  could 
not  have  been  p  processors  available,  since  at  that  point  our  algorithm  would 
schedule  some  job.  After  J  is  finished,  there  also  cannot  be  p  processors 
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available:  at  any  time  after  r  there  has  to  be  some  job  J'  running  that  was 
scheduled  after  r,  as  the  efficiency  is  no  longer  minim2d;  2md  if  there  were 
p  processors  available,  J'  would  only  require  N  —  p  processors  and  it  would 
already  have  been  scheduled  before  t,  a  contradiction. 

The  efficiency  is  at  least  oj  =  ^  during  the  entire  schedule,  and  it  less 
than  02  =  only  when  J  is  running.  The  time  when  J  is  running  is 

bounded  by  Tn,**  <  Topt,  hence  by  Lemma  7.1  we  get  T  <  (1  +  )y>pt  = 

(2  ~  )^op‘  ^  (2  ~  jf)Topt-  ^ 

9  Hypercubes 

In  this  section  we  present  an  optimal  (2  — ^)-competitive  algorithm  HYPER¬ 
CUBE  for  on-line  scheduling  on  hypercubes.  The  algorithm  is  still  greedy  in 
the  sense  that  if  there  are  sufficiently  many  processors  avtiilable  to  schedule 
a  job,  some  job  is  always  scheduled.  However,  in  contrast  to  the  algorithm 
for  PRAM,  we  now  require  that  the  largest  job  is  scheduled;  this  is  always 
possible  due  to  the  nice  structure  of  the  hypercube. 

A  similar  algorithm  appears  in  [CLSS).  Their  variation  of  the  algorithm 
is  not  on-line;  by  using  the  information  about  running  times  for  scheduling 
they  achieve  a  slightly  better  approximation  ratio  of  2  —  However,  for  on¬ 
line  algorithms  it  follows  from  the  general  lower  bound  presented  in  Section  7 
that  it  is  impossible  to  achieve  a  better  competitive  ratio  than  2  —  hence 
our  algorithm  is  optimal. 

We  suppose  that  the  jobs  Ji  =  (d,,  L)  are  sorted  by  size,  di  >  <^2  >  . . .  > 
djn,  where  m  is  the  number  of  jobs  and  di  the  dimension  of  a  hypercube 
required  by  the  job  J,.  We  say  that  a  d-dimensional  subcube  is  normal  if 
the  coordinates  of  all  its  processors  are  identical  except  possibly  the  last 
d  coordinates.  This  implies  that  if  two  normal  subcubes  of  any  dimension 
intersect  then  one  of  them  is  a  subcube  of  the  other  one.  To  ensure  that  the 
space  is  used  efficiently,  the  jobs  are  only  scheduled  on  normal  subcubes. 

Algorithm  HYPERCUBE 

for  i  :=  1  to  m  do 
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if  there  is  a  normal  di  subcube  available, 
then  schedule  the  job  Ji  on  it; 

wait. 

Theorem  9.1  The  algorithm  HYPERCUBE  is  (2  —  jf)-competitive  for  a 
hypercube  of  N  processors. 

Proof.  From  the  properties  of  normal  subcubes  and  the  scheduling  algorithm 
it  follows  that  whenever  there  is  any  processor  available,  then  there  is  a  whole 
normal  <f*dimensional  subcube  available,  where  d  is  the  dimension  of  the  job 
scheduled  last.  Since  the  jobs  are  sorted  in  a  decreasing  order  according 
to  their  dimensions,  it  follows  that  any  available  job  can  be  scheduled,  in 
particular  the  largest  one.  This  proves  that  the  efficiency  is  1  as  long  as 
there  is  some  unscheduled  job  left.  The  remaining  time  is  bounded  by  Tinax 
and  the  efficiency  is  at  least  IjN.  The  theorem  follows  by  Lemma  7.1.  □ 


10  One-dimensional  meshes 

In  this  section  we  present  two  algorithms  for  scheduling  on  one-dimensional 
meshes  of  N  processors. 

Similarly  to  the  algorithm  HYPERCUBE,  we  schedule  the  large  jobs  first. 
However,  we  can  assure  that  the  efficiency  is  1  only  if  the  sizes  of  the  jobs 
and  of  the  machine  are  powers  of  2.  If  this  is  not  the  case,  the  first  algorithm, 
ORDERED,  only  achieves  efficiency  1/2,  and  hence  it  is  3-contpetitive.  The 
second  algorithm,  CLUSTERS,  is  more  complex  and  is  2.5-competitive.  The 
best  lower  bound  we  know  is  the  general  2  —  ^  one  which  still  leaves  a  gap 
between  the  bounds. 

A  different  algorithm  which  is  not  on-line  and  achieves  an  approximation 
ratio  of  2.5  was  obtained  by  Sleator  [SIe80|.  Recently  this  result  was  improved 
by  Steinberg  [Ste9.3],  who  obtained  an  algorithm  which  is  not  on-line  but 
achieves  an  approximation  ratio  of  2. 

For  both  algorithms  we  suppose  that  the  jobs  =  (n,,  C)  are  sorted  by 
their  size,  a\  >  02  >  . . .  >  a-m,  where  m  is  the  number  of  jobs  and  a,  is  the 
number  of  processors  required  by  the  job  J,. 
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Algorithm  ORDERED  schedules  the  jobs  in  the  order  of  their  size,  exactly 
as  zdgorithm  HYPERCUBE  does  in  the  case  of  hypercubes.  Note  however 
that  if  the  sizes  of  jobs  are  not  powers  of  two,  it  can  happen  that  a  job  is  not 
scheduled  even  if  there  is  a  sufficiently  long  segment  of  available  processors. 
For  example,  if  we  have  three  jobs  of  sizes  2iV/3,  iV/2  and  iV/3,  the  algorithm 
schedules  the  first  job  of  size  27V/3,  and  then  waits  until  it  finishes  to  schedule 
the  job  of  size  N'/2.  The  job  of  size  N/Z  is  not  scheduled  even  though  there 
are  sufficiently  many  processors  available,  because  we  require  the  larger  job 
of  size  1/2  to  be  scheduled  first.  This  means  that  the  algorithm  is  less  greedy 
than  the  algorithms  for  PRAM  and  hypercubes. 

If  the  machine  consists  of  several  disconnected  segments  and  any  job  fits 
into  any  of  these  segments,  the  first  algorithm  can  still  be  applied  and  the 
ssune  bounds  hold.  This  will  be  useful  in  Section  11.1,  when  we  use  it  as  a 
subprogram  in  the  algorithms  for  two-dimensional  meshes. 

Algorithm  ORDERED 

for  i  ;=  1  to  m  do 

if  there  is  a  segment  of  Uj  processors  available, 

then  schedule  the  job  on  the  leftmost  such  segment; 

wait. 

Theorem  10.1  fij  If  the  sizes  of  all  jobs  and  of  the  machine  (or  each  seg¬ 
ment,  if  the  machine  consists  oj  inoi'e  segments)  are  powers  of  hco.  the  al¬ 
gorithm  ORDERED  is  (2  —  -^)-competilive  and  Us  efficiency  is  I  as  long  ns 
there  are  unscheduled  jobs. 

(ii)  For  general  jobs,  the  algorithm  ORDERED  is  ^-competitive  and  Us 
efficiency  is  larger  than  1/2  as  long  as  there  are  unscheduled  jobs. 

Proof.  If  the  sizes  are  powers  of  two.  the  proof  is  identical  with  the  proof 
for  hypercubes.  For  general  jobs,  as  long  as  there  is  a  job  available,  the  size 
of  any  occupied  segment  is  larger  than  the  size  of  the  largest  waiting  job 
which  is  in  turn  larger  than  the  size  of  any  available  segment.  Therefore 
the  efficiency  is  larger  than  1/2  as  long  as  some  job  is  available,  and  the 
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the  remaining  time  of  the  schedule  is  bounded  by  Tjuax.  Consequently  by 
Lemma  7.1  the  algorithm  is  3-competitive.  O 

The  second  algorithm  is  basically  a  refinement  of  the  previous  one.  If 
we  could  achieve  that  there  are  always  two  adjacent  jobs  between  any  two 
unused  intervals,  the  efficiency  would  be  2/3  instead  of  previous  1/2.  It  is 
impossible  to  maintain  this  arrangement  all  the  time,  but  by  placing  the  jobs 
carefully  we  can  still  achieve  an  efficiency  of  at  lejist  2/3. 

This  algorithm  is  less  greedy  than  the  previous  one,  since  it  can  hap¬ 
pen  that  the  largest  job  is  not  scheduled  even  if  there  are  sufficiently  many 
processors  available. 

The  algorithm  divides  the  mesh  into  a  number  of  segments,  starting  from 
one  segment  and  dividing  it  into  more  as  the  jobs  get  smaller.  We  call  those 
segments  clusters.  Each  cluster  contains  up  to  3  running  jobs:  a  left  job 
aligned  with  the  left  end  of  the  segment,  a  right  job  aligned  with  the  right 
end  of  the  segment  and  a  middle  job  somewhere  between  the  left  and  right 
ones.  (This  is  slightly  different  in  Phase  2  of  the  algorithm.) 

The  jobs  are  divided  into  the  set  J'  of  all  jobs  requiring  at  most  1/3 
of  the  processors  and  the  set  J"  of  all  jobs  requiring  more  than  1/3  of  the 
processors,  J"  =  J  —  J' .  When  we  refer  to  the  largest  job  in  J'  or  J",  we 
always  mean  the  largest  unscheduled  job. 

In  some  steps  of  the  <ilgorithm  it  is  not  evident  that  a  job  can  be  scheduled 
as  required;  we  prove  the  correctness  of  the  algorithm  in  Theorem  10.2. 

A  substantial  part  of  the  following  algorithm  and  the  proof  is  concerned 
with  large  jobs,  namely  the  step  (l)(a)  and  the  entire  Phase  2  of  the  algo¬ 
rithm.  We  feel  that  this  is  not  the  main  issue,  and  recommend  the  reader  to 
focus  on  the  other  parts. 

Algorithm  CLUSTERS 

Phase  1: 

while  there  is  an  unscheduled  job  in  J'  do 

if  there  is  a  cluster  /  with  efficiency  less  then  2/3,  then 
(1)  if  there  is  no  left  job  in  /  then 
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(a)  if  I  is  the  leftmost  cluster  and  there  is  some  job  avciilable  in 

then  schedule  the  largest  job  in  J'"  as  a  left  job  in  I, 

(b)  else  schedule  the  largest  job  in  as  a  left  job  in  I; 

(2)  else  if  there  is  no  right  job  in  /,  then  schedule  the  largest  job  in 

J'  as  a.  right  job  in  I; 

(3)  else  if  there  is  no  middle  job  in  /,  then  schedule  the  largest  job 

in  as  a  middle  job  in  /,  positioned  so  that  either  its  left  end 
is  in  the  left  third  of  the  cluster  or  its  right  end  is  in  the  right 
third  of  the  cluster; 

(4)  else  schedule  the  largest  job  in  adjacent  to  the  middle  job  and 

divide  I  into  two  clusters  with  two  jobs  each. 

Phase  2:  Considering  the  whole  mesh  as  a  single  cluster, 
while  there  is  an  unscheduled  job  in  J"  do 

(5)  if  there  is  no  left  job,  then  schedule  the  largest  job  in  J"  eis  a  left 

job, 

(6)  else  if  there  is  no  right  job,  all  jobs  from  J'  are  finished  and  the 

efficiency  is  at  most  1/2,  then  schedule  the  largest  job  in  J" 
as  a  right  job; 

wait. 

Theorem  10.2  The  algorithm  CLUSTERS  is  correct  and  2.5-competitive. 

Proof.  First  let  us  show  that  all  steps  of  the  algorithm  are  correct,  namely 
that  it  is  always  possible  to  schedule  the  jobs  as  required  by  the  algorithm. 
Steps  (l)(a)  and  (5)  ensure  that  there  is  at  leeist  one  job  from  J"  running  as 
long  as  J'*  is  nonempty  (notice  that  if  a  job  from  J"  is  finished,  the  efficiency 
drops  under  2/3  and  the  condition  in  the  step  (l)(a)  or  (5)  is  satisfied).  The 
jobs  from  J"  are  processed  in  decreasing  order,  starting  with  the  largest  job. 
so  that  the  next  job  will  always  fit. 

Steps  (l)(b),  (2),  (5)  and  (6)  are  correct  because  of  similar  reasoning, 
since  the  jobs  from  both  J'  and  J"  are  processed  in  decreasing  order. 

If  there  are  both  left  and  right  jobs  running  in  /  but  no  middle  job,  and 
the  efficiency  is  less  then  2/3,  then  either  the  left  or  the  right  job  has  to  be 
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smaller  than  1/3  of  the  length  of  /,  and  also  the  largest  unscheduled  job  in 
J'  is  smaller.  This  justifies  step  (3). 

If  there  are  3  jobs  running  in  /,  the  middle  one  had  to  be  scheduled  in 
step  (3).  Therefore  we  can  assume  that  the  middle  job  has  its  left  end  in  the 
left  third  of  /  (the  other  case  is  symmetric).  If  the  efficiency  is  less  then  2/3, 
the  left  job  must  be  smaller  than  the  space  between  the  middle  and  right  jobs 
(otherwise  the  space  occupied  by  the  three  jobs  would  be  at  least  as  large  as 
the  interval  from  the  left  end  of  the  middle  job  to  the  right  end  of  7,  which 
is  at  least  2/3  of  I  by  the  previous  assumption).  The  largest  unscheduled 
job  is  not  larger  than  the  left  job,  therefore  it  can  be  scheduled  between  the 
middle  and  right  jobs  and  step  (4)  is  justified. 

Now  we  prove  that  the  competitive  ratio  is  at  most  2.5. 

The  algorithm  ensures  that  the  efficiency  is  at  least  2/3  as  long  as  there 
is  some  job  available  in  J' .  So  if  the  job  which  finishes  last  is  from  J' ,  we 
get  by  Lemma  7.1  that  the  competitive  ratio  is  at  most  3/2  4-  1  =  2.5. 

If  the  last  job  is  from  J'\  it  means  that  during  the  entire  schedule  at 
least  one  job  from  J"  is  running  and  hence  the  efficiency  is  more  than  1/3. 

Before  we  proceed  with  the  proof,  note  that  if  the  off-line  algorithm  is  not 
allowed  virtualization,  it  can  run  at  most  two  jobs  from  J"  at  once.  Hence 
if  the  job  that  finishes  last  is  from  the  length  of  the  on-line  schedule  is 
within  a  factor  of  2  any  schedule  that  does  not  use  virtualization.  The  rest 
of  the  proof  is  needed  only  because  we  allow  virtualization  for  the  off-line 
algorithm. 

Let  T  be  the  length  of  the  entire  schedule.  VVe  distinguish  two  cases 
depending  on  the  efficiency  at  the  time  when  the  leist  job  from  J'  finishes. 
If  it  is  at  most  1/2,  then  the  only  time  when  the  efficiency  can  be  less  than 
2/3  is  at  the  beginning  of  Phase  2  after  the  last  job  from  J'  is  scheduled  and 
at  the  end  of  Phase  2  when  only  one  job  is  running.  Therefore  T^i  S  2Tn,ax 
and  by  Lemma  7.1,  T  <  (2  +  =  2.57^, pt. 

If  the  efficiency  at  the  time  when  the  last  job  from  J'  finishes  is  more 
than  1/2,  we  know  that  the  efficiency  is  at  leaist  1/2  except  for  the  time 
when  only  the  last  job  from  J""  is  running:  it  is  at  least  2/3  during  Phase 
1,  it  is  at  least  1/2  before  until  the  last  job  form  J'  finishes,  and  then  the 
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second  job  from  J"  is  scheduled  by  the  step  (6)  as  soon  as  the  eflSciency 
drops  under  1/2,  and  the  efficiency  is  again  at  least  2/3,  as  two  jobs  from 
J**  are  scheduled  simultaneously.  Therefore  and  by  Lemma  7.1, 

T<{1  +  ^)To^  <  2.5r„p..  □ 

11  The  two-dimensional  mesh 

11.1  Deterministic  algorithms 

Because  of  the  more  complex  geometric  structure  of  the  two-dimensional 
mesh,  the  greedy  approach  does  not  work  too  well.  A  better  strategy  is  to 
partition  the  jobs  according  to  one  dimension  first,  amd  then  to  schedule  the 
jobs  in  the  same  partition  using  one  of  the  algorithms  for  one-dimensional 
meshes  from  the  previous  section. 

We  first  give  some  definitions  and  simple  algorithms.  Then  we  gradually 
build  the  optimal  algorithm  using  the  simple  and  less  efficient  algorithms  as 
subprograms. 

Throughout  this  section  we  work  with  an  nx  xn2  mesh  of  N  =  niUj 
processors.  A  job  requiring  a  a,-  x  6,  rnesh  with  running  time  is  represented 
as  (a,,  6,,  t,)  (of  course,  on-line  algorithms  do  not  know  <,).  We  assume  that 
nx  >  nj  and  that  for  each  job  a,  >  6,  without  loss  of  generality. 

We  first  describe  a  simple  modification  of  the  algorithm  ORDERED.  It 
treats  the  jobs  as  one-dimensional,  and  thus  it  is  efficient  only  if  the  heights 
of  all  the  jobs  are  within  a  small  range.  Let  h  denote  the  maximal  height  of 
a  job,  b=  max{6,|(a,-,  6,-, /,)  €  J}. 


Algorithm  CLASS 

Partition  the  mesh  into  [n^/hl  submeshes  of  size  nj  xb: 

Apply  ORDERED  to  schedule  J  on  these  submeshes,  disregarding  the 
second  dimension  of  the  jobs  and  the  submeshes,  viewing  them  as 
one-dimensional. 

Lemma  11.1  Suppose  that  the  height  of  any  job  is  more  than  6/2. 
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(i)  If  the  dimensions  of  all  jobs  and  of  the  machine  are  powers  of  two, 
the  algorithm  CLASS  is  2-competitive  and  its  efficiency  is  1  as  long  as  there 
are  unscheduled  jobs. 

(ii)  For  general  dimensions,  the  algorithm  CLASS  is  ^-competitive  and 
its  efficiency  is  larger  than  1  /8  as  long  as  there  are  unscheduled  jobs. 

Proof.  If  the  dimensions  are  powers  of  two,  the  efficiency  does  not  decrease 
by  disregarding  the  second  dimension,  as  it  is  the  same  for  all  jobs.  For 
general  jobs,  the  efficiency  decreases  by  less  than  factor  of  two,  since  we 
assume  that  the  second  dimension  of  every  job  is  more  than  b/2,  and  by  an 
additional  factor  smaller  than  two  because  of  rounding  when  we  divide  the 
mesh.  The  rest  follows  from  Theorem  10.1.  □ 

Let  ,7  be  a  set  of  two-dimensional  jobs.  Define  a  p2Lrtition  of  J  into  job 
classes  J  =  U  . . .  U  by  J(t)  =  {(a.,6„<.)  €  J|n2/2'+»  <  6,  < 

n2/2^}.  Define  the  order  of  J  to  be  the  number  of  nonempty  job  classes, 
otder(J)  =  ,4  8}|. 

The  algorithm  CLASS  is  efficient  if  the  job  system  has  only  one  job  class, 
but  not  for  general  job  systems.  The  simplest  way  to  use  it  for  general  job 
systems  is  to  schedule  the  classes  one  by  one.  This  leads  to  the  following 
0(log  iV)-competitive  algorithm. 

Algorithm  SERIAL 
for  /  :=  0  to  log  nj  do 

Apply  CLASS  to  schedule  the  jobs  from 

Lemma  11.2  (i)  If  the  dimensions  of  all  jobs  and  of  the  mesh  are  powers 
of  two,  the  algorithm  SERIAL  is  (orderfjT^)  -h  \)-competitive. 

(ii)  For  general  jobs,  the  algorithm  SERIAL  is  (order(7")-|-8)-compeh’hue. 

Proof.  By  Lemma  11.1  the  time  when  the  efficiency  is  low  (less  than  1  if 
dimensions  are  powers  of  two,  less  than  1/8  for  general  dimensions)  is  at 
most  Tlnax  for  every  nonempty  class,  the  total  of  order(.7)T.,..x.  Lemma  7.1 
finishes  the  proof.  CD 
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To  achieve  a  lower  competitive  ratio,  we  schedule  multiple  job  classes 
in  parallel  in  different  submeshes.  However,  then  the  large  jobs  may  not 
fit  into  our  submeshes.  We  solve  this  problem  by  handling  the  large  jobs 
separately.  We  use  the  previous  algorithm  to  schedule  a  small  number  of 
classes  which  contain  the  large  jobs.  This  approach  leads  to  the  following 
O(log  log  iV)-competitive  algorithm. 

Algorithm  PARALLEL 

(1)  Use  SERIAL  to  schedule  j(0; 

let  t  :=  1; 

repeat 

(2)  let  Ji  be  all  jobs  that  have  not  been  scheduled  yet; 
let  hi  :=  order(  Ji); 

partition  the  mesh  into  hi  submeshes  of  size  ni  x  [nj/AtJ; 
to  each  nonempty  job  class  assign  one  of  these  submeshes  and 
denote  it  by  G,,/; 

(3)  apply  CLASS  in  parallel  to  each  nonempty  class  on  G,,/  until 

the  first  time  when  the  total  efficiency  of  the  running  jobs  (with 
respect  to  the  whole  n\  mesh)  is  less  than  1/16; 

(4)  wait; 

let  i  ;=  i  +  1; 

until  all  jobs  are  scheduled. 

Theorem  11.3  Let  S  be  a  schedule  generated  by  PARALLEL  for  a  job  sys¬ 
tem  J .  Then  T{S)  <  O(log(order(  JjjjToptC J^)-  fn  particular,  the  algorithm 
PARALLEL  is  0{\oglog  N) -competitive  for  scheduling  on  a  two-dimensional 
mesh  of  N  processors. 

Proof.  For  every  i  and  I  the  submesh  G,./  is  large  enough  to  fit  any  job 
from  since  /  >  log(order(  J))  if  is  nonempty  after  the  step  (1),  and 
then  fore  the  smaller  dimension  of  any  job  in  is  at  most  n2/order(  J)  < 
n2thi. 
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By  Lemma  11.2,  step  (1)  takes  time  at  most  (log(order(J'))  +  8)Topt  by 

To  botmd  the  time  spent  in  the  loop  of  steps  (2)  to  (4),  we  first  prove 
that  hi+i  <  hi/2  for  all  i  during  the  execution  of  PARALLEL.  If  there  are 
unscheduled  jobs  in  at  the  end  of  step  (3),  then  by  Lemma  11.1  the 
efficiency  of  the  schedule  for  the  jobs  in  (with  respect  G,,/)  is  greater 
than  1/8.  Therefore,  the  total  efficiency  at  the  end  of  step  (3)  is  at  least 
hi^i/Shi.  On  the  other  hand,  by  the  condition  in  (3),  the  efficiency  is  less 
than  1/16.  Hence  ht+i  <  hi /2. 

It  follows  that  the  number  of  passes  through  steps  (2)  to  (4)  is  at 
most  log(order(,7))  +  1.  Since  the  time  spent  in  each  pass  through  step 
(4)  is  bounded  by  and  the  efficiency  during  the  step  (3)  is  at  least 
1/16,  the  total  time  spent  in  steps  (2)  to  (4)  is  by  Lemma  7.1  bounded  by 
(log(order(J’))  +  17)Topt. 

Therefore  the  length  of  the  schedule  is  at  most  0(log(order(j7')))Topt  = 
G(log  log  iVjTopt,  which  finishes  the  proof.  □ 

Now  we  construct  an  0(  Viog  log  iV)-competitive  on-line  scheduling  algo¬ 
rithm  for  the  two-dimensional  mesh.  We  use  the  previous  algorithm  PAR¬ 
ALLEL  to  schedule  the  large  jobs. 

The  improvement  of  this  algorithm  over  PARALLEL  is  in  making  an  op¬ 
timal  tradeoff  between  the  average  efficiency  of  the  schedule  and  the  amount 
of  time  that  the  efficiency  of  the  schedule  is  below  average.  To  achieve  this 
tradeoff,  we  schedule  new  jobs  in  a  small  part  of  the  mesh,  instead  of  using 
the  whole  machine  2is  in  P.ARALLEL.  This  ensures  that  when  the  efficiency 
is  too  low,  vve  can  use  the  next  part  immediately,  instead  of  waiting  for  all 
running  jobs  to  finish. 

In  Section  11.3  we  prove  a  matching  lower  bound.  The  proof  of  the  lower 
bound  shows  that  this  non-intuitive  strategy  of  using  only  a  part  of  the  mesh 
is  in  fact  necessary  to  achieve  the  optimal  competitive  ratio.  If  we  try  to 
use  the  whole  mesh,  we  cannot  get  a  much  Ijetter  competitive  ratio  that 
O(loglogW)  of  the  algorithm  P.ARALLEL. 

Let  k  =  [y^log(order(  J"))  j  =  G(\/log  log  N).  We  partition  the  mesh  into 
k  submeshes  of  size  ni  x  [rij/ArJ,  denoted  by  Gj,  I  <  j  <  k  (see  Figure  3). 
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Figure  3:  The  partition  of  the  mesh  used  in  the  algorithm  BALANCED  PAR¬ 
ALLEL.  (The  thick  lines  partition  the  mesh  into  submeshes  Gj,  the  thin  lines 
partition  them  further  into  submeshes 


Algorithm  BALANCED  PARALLEL 

(1)  Use  PARALLEL  to  schedule 
let  i  =  1; 
repeat 

for  j  :=  1  to  k  do  begin 

(2)  let  be  all  jobs  that  have  not  been  scheduled  yet: 
let  hi,j  :=  order(v/i,_,); 

partition  the  mesh  into  hi,j  subinesUes  ot  size  iii  x  [iinj khi^/y, 
to  each  nonempty  job  class  assign  one  of  these  submeshes 
and  denote  it  by  Gi,jj  (see  Figure  3); 

(3)  apply  CLASS  in  parallel  to  each  nonempty  class  J"/*'  on  G,.j,i 

until  the  first  time  when  the  efficiency  of  all  currently  run¬ 
ning  jobs  with  respect  to  Gj  is  less  than  1/16.  then  interrupt 
all  instances  of  CLASS  (but  allow  the  current  jobs  to  be  pro¬ 
cessed); 

end; 
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(4)  wait; 

let  i  :=  i  +  1; 

until  all  jobs  axe  scheduled. 

Theorem  11.4  The  algorithm  BALANCED  PARALLEL  is  0(\/loglog  AT)- 
competitive  for  scheduling  on  a  two-dimensional  mesh  of  N  processors. 

Proof.  For  every  i,  j  and  /  the  submesh  Gijj  is  large  enough  to  fit  any  job 
from  since  /  >  2  log  log  if  is  nonempty  after  the  step  (1),  and 
therefore  the  smaller  dimension  of  any  job  in  is  at  most  n2/(log  AT)^  < 
n-xlkhi. 

By  Theorem  11.3,  step  (1)  takes  time  at  most  O(log(loglog  A^jjTopt  < 
0(v/IogTogW)ropt. 

As  in  the  proof  of  Theorem  11.3,  h,^  decreases  by  a  factor  of  2  after  each 
pass  through  the  steps  (2)  to  (3).  That  means  that  during  each  pass  through 
the  repeat-until  loop  it  decreases  by  at  least  a  factor  of  2*“  and  therefore  there 
can  be  at  most  flog(order(  J'))/A:]  =  0{  y^og  log  N)  passes  through  the  step 
(4).  Since  the  time  spent  in  each  pass  through  step  (4)  is  bounded  by  Tm*x 
and  the  efficiency  with  respect  to  the  whole  mesh  during  the  step  (3)  is  at 
least  1/16A:  =  n(l/Vloglog  A),  the  total  time  spent  in  steps  (2)  to  (4)  is  by 
Lemma  7.1  bounded  by  0{  \/log  log  iVjTopf 

Note  that  we  have  actually  proved  that  the  competitive  ratio  is  at  most 
0(\/loglog  Tij),  which  is  smaller  than  0{y/log  log  N)  if  n2  is  much  smaller 
than  ni. 

11.2  Off-line  scheduling 

In  this  section  we  prove  that  for  any  job  system  J,  Topt(v7)  is  within  a 
constant  factor  of  max(Tfifr(  J"),  TmaAJ))-  Intuitively,  this  means  that  if  there 
are  no  long  jobs,  we  can  schedule  all  the  jobs  so  that  the  average  efficiency 
is  at  least  some  constant. 

We  use  this  result  twice.  In  Section  11.3  we  prove  that  no  on-line  algo¬ 
rithm  can  guarantee  that  the  length  of  a  schedule  S  it  generates  is  bounded 
by  T{S)  =  o(  v^og  log  N)  maxlTeff  iTxaax)-  To  derive  a  lower  bound  on  the 
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competitive  ratio,  we  will  use  the  result  of  this  section.  In  Section  11.4  we 
use  the  same  ideas  as  a  part  of  our  on-line  randomized  algorithm. 

To  make  our  exposition  simpler,  we  first  assume  that  the  dimensions  of  all 
the  jobs  and  the  machine  are  powers  of  two  and  the  machine  is  a  square  nxn 
mesh.  Let  m/  =  n/2*.  In  contrast  to  the  on-line  deterministic  algorithm,  we 
now  divide  the  jobs  into  classes  according  to  their  larger  dimension,  = 
6  J\mi+t  <  a.  <  m/}. 

The  idea  of  our  algorithm  is  to  assign  to  each  class  an  cirea  proportional 
to  the  work  in  that  class,  and  then  schedule  each  class  using  the  on-line  algo¬ 
rithm  ORDERED.  We  have  to  ensure  that  every  job  fits  in  the  area  2issigned 
to  it.  To  achieve  this,  we  schedule  the  class  separately,  and  require  the 
area  assigned  to  class  to  be  a  union  of  several  square  submeshes  of  size 
n/2‘.  This  rounding  requires  some  additional  area,  which  can  be  bounded 
by  a  third  of  the  size  of  the  machine. 

The  schedule  is  generated  by  the  following  off-line  algorithm.  Wi  denotes 
the  work  of  W  denotes  the  total  work  of  all  classes  except  and  zi 
denotes  the  number  of  m(  x  mi  meshes  assigned  to 

Algorithm  OFFLINE 

(1)  schedule  using  the  algorithm  ORDERED; 

(2)  for  each  /  >  0,  let  W,  := 
let  W  :=  E,>o  W,; 

for  each  /  >  0,  let  zi  \^Wi/m'fW]\ 

(.3)  choose  disjoint  square  submeshes  Hi  /  >  0,  ^  =  1, ....  r/  of  size 
mixmi  (see  below  for  details). 

(4)  in  parallel  for  each  I  schedule  the  class  on  the  collection  of  grids 
....  Hi,~, }  using  ORDERED; 

We  need  to  justify  the  step  (3),  Since  the  meshes  are  squares  whose  sizes 
are  powers  of  two,  we  can  place  them  greedily  starting  with  the  largest  mesh, 
so  that  the  coordinates  of  each  mesh  are  divisible  by  its  size.  See  Figure  4 
for  an  example. 
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Figure  4:  An  example  of  the  partition  of  the  mesh  used  in  the  algorithm 
OFFLINE. 


This  process  will  place  all  the  meshes  as  long  ais  the*  number  of  processors 
is  not  too  large,  which  is  true  in  our  case  since 
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For  this  algorithm  we  cannot  prove  that  the  time  when  the  efficiency  is 
low  is  short.  Instead,  we  prove  a  similar  claim  about  the  average  efficiency. 
More  precisely,  we  prove  that  the  length  of  the  schedule  is  bounded  by  a 
constant  multiple  of  the  work  divided  by  the  number  of  processors,  plus  an 
additive  term  bounded  by  a  constant  multiple  of  the  longest  tunning  time. 


Theorem  11.5  (i)  If  the  dimensions  of  the  jobs  and  of  the  machine  are 
powers  of  two  and  the  machine  is  a  square  mesh,  the  algorithm  OFFLINE 
produces  a  schedule  S  whose  length  is  hounded  by  T{S)  <  3.5  maxlTeff,  Tmax)- 
(ii)  For  general  dimensions,  there  exists  an  off-line  algorithm  which  pro¬ 
duces  a  schedule  S  whose  length  is  bounded  by  T(S)  =  0(max(Teff,  Tmax))- 


Proof,  (i)  We  prove  for  each  /  >  0  that  the  instance  of  ORDERED  schedul¬ 
ing  in  step  (4)  finishes  after  the  time  at  most  +  Tau^.  Suppose 

for  a  contradiction  that  for  some  I  this  is  not  the  case.  Then  for  time  more 
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than  3lV/2n^  there  are  jobs  available,  and  hence  by  Lemma  11.1  the  effi¬ 
ciency  relative  to  the  area  ^signed  to  is  1.  Thus  the  work  done  by  jobs 
in  is  more  than  (3W’/2n^)z/mf  >  Wi,  a  contradiction.  Therefore  the 
step  (4)  takes  time  at  most  3WI2n^  +  Tm... 

During  the  step  (1),  the  efficiency  is  1  except  for  time  Let  W  be 

the  total  work  including  Then  the  length  of  the  schedule  is  bounded 
by  ZW'/2n^  -H  2T^  =  -I-  2T^. 

(ii)  For  general  dimensions,  we  first  schedule  large  jobs.  Then  we  round 
the  dimensions  of  jobs  to  the  next  higher  power  of  two  and  the  size  of  the 
mesh  to  the  next  smaller  power  of  two,  and  proceed  as  before.  This  changes 
the  efficiency  by  a  constant  factor  only. 

If  the  mesh  is  not  a  square,  we  partition  the  jobs  into  cl2isses  according  to 
their  smaller  dimension.  Instead  of  square  meshes  we  then  assign  the  area  to 
each  class  in  submeshes  whose  longer  dimension  is  the  longer  dimension  of 
the  machine.  This  is  less  efficient,  since  we  use  larger  area  for  rounding  the 
area  assigned  to  each  class  to  these  submeshes,  but  the  difference  is  again 
only  in  the  constant  factor.  □ 

11.3  A  lower  bound  on  deterministic  scheduling 

We  prove  that  no  on-line  scheduling  algorithm  on  an  iixn  mesh  of  .V  proces¬ 
sors  can  achieve  a  competitive  ratio  better  than  e v^log  log  .V  for  some  £  >  0. 
This  proves  that  the  algorithm  in  Section  11. 1  is  optimal  up  to  a  constant 
factor.  (Note  that  it  is  sufficient  to  prove  the  lower  hound  for  square  meshes.) 

To  prove  this  lower  bound  we  use  an  adversary  iis  introduced  in  Sec¬ 
tion  7.  We  specify  the  number  of  jobs  and  their  job  types,  and  then  «lesign 
an  adversary  who  assigns  the  running  times  depending  on  the  action  of  the 
scheduler. 

The  adversary  tries  to  restrict  the  possibilities  of  the  scheduler  so  that  he 
has  to  act  similar  to  the  optimal  algorithm  we  presented  in  Section  1  l.l.  The 
key  technical  point  is  Lemma  11.6.  It  shows  that  the  adversary  can  restrict 
the  actions  of  the  scheduler  substantially.  .More  precisely,  if  the  efficiency  is 
high,  the  adversary  is  able  to  find  a  small  subset  of  the  running  jobs  wl.ich 
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effectively  block  a  large  portion  of  the  mesh.  This  implies  that  new  jobs  have 
to  take  new  space,  and  eventually  there  is  no  more  space  available  and  the 
scheduler  must  wait  until  the  running  jobs  are  stopped. 

Of  course,  the  adversary  cannot  go  on  forever.  The  price  he  pays  is  that 
after  each  such  step  the  number  of  distinct  sizes  of  available  jobs  is  reduced. 
Nevertheless,  it  is  possible  to  repeat  this  process  sufficiently  many  times 
before  all  available  job  sizes  are  eliminated. 

11.3.1  Notation 

For  the  proof  of  the  lower  bound  it  is  convenient  to  represent  meshes  as 
rectangles  with  both  coordinates  running  through  the  real  interval  [0,n].  A 
processor  then  corresponds  to  a  unit  square  and  a  x  x  y-submesh  at  {X,Y) 
corresponds  to  the  xxy  rectangle  with  the  lower  left  corner  at  (X,  Y). 

During  the  proofs  we  will  also  use  rectangles  with  non-integer  dimensions 
and  coordinates.  We  say  that  a  rectangle  R'  intersects  a  set  of  rectangles  It, 
if  the  area  oi  R'  C\  R  is  not  zero  for  some  R  £  71. 

A  normal  x  x  y-rectangle  is  a  rectangle  with  width  x  and  height  y  with 
the  lower  left  corner  at  {X,Y)  such  that  X  is  an  integer  multiple  of  x  and 
Y  is  an  integer  multiple  of  y.  A  normal  (x,y) -rectangle  is  a  normal  xxy-  or 
y  xx-rectangle. 

Observe  that  any  two  normal  x  x  y-rectangles  are  disjoint  and  that  the 
any  rectangle  larger  than  xxy  (in  particular  the  nxn  mesh  in  our  case)  can 
be  partitioned  into  a  set  of  non-intersecting  normal  x  x  y-rectangles  and  a 
small  leftover. 

11.3.2  The  scheduling  problem 

Now  we  are  ready  to  specify  the  job  system  used  for  the  lower  bound  proof. 
Let  k  =  [i>/log  log  .Vj,  .s  =  [(log  log  A )2].  t  =  [ilog^nj. 

We  have  t  -I-  I  different  job  classes.  A  =  Jo  U  . . .  U  J't-  The  job  class 
Jj  contains  nk^  jobs  of  size  ^  x  s-’  (the  running  times  of  the  jobs  will  be 
f'etermined  dynamically  by  the  adversary  depending  on  the  actions  of  the 
on-line  scheduler).  Note  that  (or  i  <  j  <  t,  ^  ^  >  s-’  >  s' . 
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Suppose  we  use  an  on-line  scheduling  algorithm  to  schedule  this  job  sys¬ 
tem.  For  /  C  [0..<],  let  C(/)  denote  the  set  of  all  submeshes  corresponding 
to  currently  running  jobs  from  Jj,  j  €  /.  Let  C  =  C([0 ..  t])  denote  the  set  of 
submeshes  corresponding  to  all  currently  running  jobs. 

11.3.3  Adversary  strategy 

The  adversary  strategy  is  based  on  the  following  lemma  which  is  proved  in 
Section  11.3.4. 

Lemma  11.6  Suppose  that  we  are  scheduling  the  job  system  from  Sec¬ 
tion  11.3.2.  Then  at  any  time  of  the  schedule  and  for  any  interval  I'  =  [a ..  b], 
0  <  a  <  b  <  t,  and  I'  =  [0 ..  t]  —  I' ,  there  exists  a  setT)  —  T>{I')  C  C{P)  such 
that  eS{'D)  <  l/Sk  and  every  normal  -rectangle  intersected  by  C{I') 

is  also  intersected  by  V. 

The  adversary  maintains  an  active  interval  denoted  by  /.  Initially  /  = 
[0..  tj  and  with  time  /  gradually  gets  smaller.  Let  T  be  some  fixed  time.  The 
adversary  reacts  to  the  scheduler’s  actions  according  to  the  following  steps. 
SINGLE  JOB:  If  the  scheduler  schedules  some  job  on  a  submesh  with 
an  area  smaller  than  n/k,  then  the  adversary  removes  all  other  jobs  (both 
running  and  waiting  ones)  and  runs  this  single  job  for  a  sufficient  amount  of 
time. 

DUMMY;  If  the  scheduler  starts  a  job  that  does  not  belong  to  a  job  class 
in  the  active  interval  (a  job  from  J^j.  j  ^  f).  then  the  adversary  removes  it 
immediately. 

CLEAN  UP:  If  the  time  since  the  last  CLEAN  UP  step  (or  since  the 
beginning  of  the  schedule)  is  equal  to  T  and  there  was  no  .SINGLE  .JOB 
step,  then  the  adversary  removes  all  running  jobs,  i.e..  assigns  their  running 
times  so  that  they  are  completed  at  this  point. 

DECREASE  EFFICIENCY:  If  eff(C)  exceeds  \/k.  the  adversary  does 
the  following:  He  takes  an  interval  I'  Q  I  such  that  |/'|  =  [|/|/2J  and 
efF(C(/'))  <  eff(C(/))/2  (such  I'  obviously  exists:  either  the  upper  or  the 
lower  half  of  7,  whichever  has  lower  efficiency).  Then  he  computes  T>{I') 


52 


according  to  Lemma  11.6  and  removes  all  jobs  except  those  from  C(/')Ul>(/'). 
He  then  sets  the  active  interval  to  /'. 

11.3.4  Evaluation  of  the  Adversary  Strategy 

In  this  section  we  prove  that  the  adversary  strategy  from  the  previous  section 
ensures  that  T{S)  >  k  •  max(reff(  J’),Tinax(»7))-  By  Theorem  11.5  this  is 
sufficient  to  prove  the  lower  bound  on  the  competitive  ratio. 

If  the  scheduler  starts  a  job  that  does  not  belong  to  a  job  class  in  the 
active  interval  then  immediately  removing  it  by  a  DUMMY  step  essentially 
does  not  change  the  schedule.  If  the  scheduler  allows  a  SINGLE  JOB  step, 
the  scheduler  used  a  simulation  factor  greater  than  k  to  schedule  this  job. 
Therefore  in  an  optimal  schedule  the  running  time  of  this  job  will  be  more 
than  k  times  shorter.  The  adversary  assigns  a  sufficiently  large  time  to  this 
job,  and  thus  guarantees  that  the  scheduler  is  not  /^-competitive. 

So  we  assume  that  the  scheduler  always  starts  jobs  from  the  active  in¬ 
terval  and  the  adversary  only  performs  DECREASE  EFFICIENCY  and 
CLEAN  UP  steps.  The  CLEAN  UP  steps  divide  the  schedule  into  phases. 

The  lower  bound  proof  follows  this  outline.  We  first  prove  Lemma  11.6 
which  justifies  the  DECREASE  EFFICIENCY  step  and  then  we  show  that 
each  phase  can  have  at  most  of  these  steps.  This  implies  thai  every 
schedule  has  to  have  at  least  k  phases  thus  proving  that  r{S)  >  kT  > 
kTtaax[J)-  Because  the  efficiency  is  at  most  {jk  during  the  entire  schedule, 
we  get  T[S)  >  kT^ffiJ). 

The  next  claim  is  the  key  to  the  proof  of  Lemma  ll.().  It  states  a  purely 
geometrical  fact  which  is  true  for  an  arbitrary  set  of  sul^meshes  D.  hut  we 
will  only  use  it  for  P  being  a  subset  of  running  jobs. 

Claim  11.7  Let  y  and  v  he  yiven  and  let  D  be  a  set  of  sub  meshes  with  height 
at  most  y/v.  Then  there  exists  a  D'  C  P  such  that  ('lf('P')  <  '1/r  and  each 
normal  1  x  y -rectangle  intersected  by  P  is  also  intersected  by  P' . 

Proof.  Let  R  be  a  normal  n  xy-rr 'tangle.  We  will  define  P/j  C  P  such  that 
eff(PH)  <  'lyjvn  and  it  intersects  all  columns  of  R  intersected  by  P.  It  is 
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then  sufficient  to  set  IV  =  D/i,  since  every  normal  1  x  y-rectangle  is  a 
column  of  some  normal  n  x  y-rectangle.  Because  there  are  only  normal 
n  X y-rectangles  the  efficiency  of  D'  is  efF(t!>')  <  [n/yj2y/t;n  <  2/v. 

T>r  is  obtained  by  a  sweep  over  the  mesh.  Let  D,  €  P  be  the  submesh 
which  intersects  the  ith  column  and  has  the  largest  right  coordinate  of  all  such 
submeshes  (P,-  is  undefined  if  no  such  submesh  exists).  Define  a  sequence 
Po  =  0,Pi,...,Pn  =PR  by 

P,_i  U  {P,}  if  the  ith  column  of  R  intersects  P  but  not  P,_i, 

P.  =  < 

P,_i  otherwise. 

From  the  way  we  choose  P,  it  follows  that  no  column  of  R  is  intersected  by 
more  than  two  submeshes  of  Pr.  Since  the  height  of  every  submesh  of  P/j 
is  at  most  y/u,  we  have  eff(PR)  <  2yfvn.  □ 

Proof  of  Lemma  11.6.  Divide  C{P)  into  the  following  three  parts: 

•  Cl  contains  all  submeshes  whose  heights  are  at  most 

•  Ci  contains  all  submeshes  whose  heights  as  well  as  widths  are  at  most 

and 

•  C3  contains  all  submeshes  whose  widths  are  at  most  s'*"*. 

.A.11  submeshes  from  C([0..(a  —  i)])  are  either  in  Ci  or  in  C3  depending  on 
their  orientation  and  all  submeshes  of  1)  ••  i])  are  in  Cz,  hence  C{I')  = 

Cl  U  C'i  U  C3. 

We  apply  Claim  11.7  four  times  to  obtain  the  following  sets: 

•  Pi  C  Cl,  which  intersects  all  normal  lx  ^-rectangles  intersected  by  Ci; 

•  P2  C  Cz  UC3,  which  inter.serts  all  normal  x  1 -rectangles  intersected 
by  Cz  U  C3; 

•  P.3  ^  <^i  U  Cz,  which  intersects  all  normal  1  x  ;j^-rectangles  intersected 
by  Cl  UC2; 

•  P4  C  C3,  which  intersects  all  normal  ^x  1-rectangles  intersected  by  C3. 


54 


Hence  V\  U  D2  intersects  all  normal  x  ^-rectangles  intersected  by  C(/') 
and  I>3  U  I>4  intersects  all  normal  ^  x  ^-rectangles  intersected  by  C{P). 

In  all  four  cases  we  have  v  =  3 /4k.  So  setting  D  =  U  P2  U  U 
gives  us  an  efficiency  eff(D)  <  32k /s  <  I /8k  for  sufficiently  large  n.  □ 

The  next  claim  is  the  key  to  the  proof  of  the  lower  bound.  It  shows 
that  the  adversary  strategy  does  not  allow  the  scheduler  to  reuse  the  space 
efficiently. 

Claim  11.8  Let  j  €.  I  and  R  be  a  normal  -rectangle  that  does  not 

intersect  C  at  this  step  of  the  schedule.  Then  during  the  current  phase  R  has 
never  intersected  C. 

Proof.  The  DECREASE  EFFICIENCY  step  is  the  only  step  removing  jobs 
during  a  phase,  and  hence  it  is  sufficient  to  prove  that  no  previous  DE¬ 
CREASE  EFFICIENCY  step  removed  ail  the  jobs  which  intersected  R. 

Assume  that  the  active  interval  before  such  a  previous  DECREASE  EF¬ 
FICIENCY  step  is  [a.. 6],  Since  the  active  interval  never  grows,  we  have 
a  <  i  <b.  Because  the  dimensions  of  R  are  integer  multiples  of  the  dimen¬ 
sion  of  the  normal  rectangles  from  Lemma  11.6  tis  used  in  that  DECREASE 
EFFICIENCY  step,  R  can  be  partitioned  into  a  set  %  of  such  rectangles 
without  any  leftover.  According  to  the  assumption,  the  rectangle  /?,  and 
hence  some  rectangle  from  7?,  is  intersected  by  C  before  the  DECREASE 
EFFICIENCY  step.  But  then  by  Lemma  11.6  this  rectangle,  and  hence  also 
R,  is  also  intersected  by  C  after  the  DECREASE  EFFICIENCY  step.  □ 

Claim  11.9  Each  phase  can  have  at  most  61’  DECREASE  EFFICIENCY 
steps. 

Proof.  Suppose  that  the  scheduler  starts  a  job  from  Jj,  j  €  I,  in  a  submesh 
C.  As  this  does  not  cause  a  SINGLE  .JOB  step,  the  job  is  scheduled  on 
a  submesh  with  dimensions  at  least  ~  and  y-  Thus  at  least  half  of  that 
submesh  consists  of  normal  ^)-rectangles.  It  follows  from  the  previous 
claim  that  at  least  half  of  the  area  could  not  have  been  used  so  far  during 
the  current  phase. 
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The  efficiency  immediately  before  a  DECREASE  EFFICIENCY  step  is  at 
most  1/1:  +  1/n  (since  the  threshold  1/fc  was  reached  by  the  last  job  started). 
By  the  construction  and  Lemma  11.6,  the  efficiency  after  the  DECREASE 
EFFICIENCY  step  is  at  most  1/2A:  +  l/2n  +  1/8A:  <  2/3A:  for  sufficiently 
large  ri. 

This  implies  that  between  two  DECREASE  EFFICIENCY  steps  the 
scheduler  has  to  schedule  jobs  of  the  current  active  interval  which  will  in¬ 
crease  the  efficiency  by  at  least  1/3A;  using  at  least  1/6A;  of  the  area  that  was 
not  yet  used  in  this  phase.  This  can  be  done  at  most  6k  times.  □ 

Theorem  11.10  The  adversary  strategy  forces  that  for  every  schedule  S 
generated  by  the  on-line  scheduler 

nS)  >  jv'log  log  N  ■  max(r.j(  J), 

Proof.  We  first  prove  that  after  k  phases  the  active  interval  I  is  still 
nonempty.  Each  DECREASE  EFFICIENCY  step  halves  /  while  all  other 
steps  leave  it  unchanged.  In  k  phases  there  are  at  most  6A:^  <  |  log  log  N 
DECREASE  EFFICIENCY  steps.  At  the  beginning  the  length  of  /  is 
t  +  1  >  I  log,  n  =  2iogf(io?iogyv)^l  >  2^3 for  sufficiently  large  n.  So  the 
active  interval  cannot  be  empty  after  k  phases. 

If  j  is  in  the  active  interval  at  the  end  of  the  kth  phase  then  it  was  in 
the  active  interval  during  all  k  phases  and  the  adversary  could  remove  the 
jobs  of  Jj  only  in  the  k  CLEAN  UP  steps.  During  one  CLEAN  UP  step  he 
could  remove  at  most  nk  of  such  jobs  (since  each  job  h2is  area  at  least  n/k), 
hence  some  of  them  are  not  finished  before  the  end  of  the  fcth  phase.  Hence 

r(5)  >kT>  kT^^iJ). 

During  the  entire  schedule  the  efficiency  was  at  most  1/fc,  hence  T{S)  > 
kT^ffiJ).  Therefore  T(S)  >  k  ■  ma.x{T^a(J),  T^^iJ)).  □ 

This  together  with  Theorem  11.5  gives  the  following  theorem. 

Theorem  11.11  The  competitive  ratio  of  any  on-line  scheduling  algorithm 
for  a  mesh  of  N  processors  is  at  least  n(VloglogN).  □ 
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11.4  Randomized  scheduling 

In  this  section  we  give  a  constant  competitive  randomized  algorithm  for  two- 
dimensional  meshes.  The  competitive  ratio  is  28  if  the  dimensions  of  all  the 
jobs  and  of  the  machine  are  powers  of  two;  otherwise  it  can  be  bounded  by 
44. 

The  basic  idea  of  the  algorithm  is  the  following.  We  partition  the  jobs 
according  to  their  size.  For  each  size  we  schedule  a  random  sample  of  jobs 
and  estimate  the  total  work  of  the  jobs  of  that  size.  Then  we  partition  the 
mesh  so  that  each  job  size  is  assigned  aui  area  proportional  to  the  estimated 
work  of  jobs  of  that  size  and  schedule  all  jobs  of  given  size  in  the  2issigned 
area. 

There  are  some  issues  we  have  to  deal  with.  We  need  to  have  an  estimate 
on  the  longest  running  time,  as  this  is  crucial  for  any  sampling.  To  solve 
this  problem,  we  assume  that  the  longest  job  is  at  most  twice  as  long  as  the 
longest  job  we  have  seen  so  far.  If  we  see  a  longer  job,  we  abort  our  current 
attempt  and  start  from  the  beginning  while  doubling  our  estimate.  We  bound 
the  time  of  the  schedule  by  a  sum  of  a  term  proportional  to  the  work  done 
amd  a  term  which  is  bounded  by  a  constant  multiple  of  our  estimate  of  the 
longest  running  time.  If  we  sum  the  bounds  for  the  parts  of  schedule  with 
different  estimates,  the  sum  of  the  first  terms  is  still  proportional  to  the  work, 
while  the  second  terms  are  a  geometric  sequence  and  hence  it  is  bounded  by 
a  constant  multiple  of  the  longest  running  time  and  our  doubling  strategy 
works. 

Even  if  we  hav*  "ect  bound  on  the  longest  running  time,  sampling  is 

not  trivial.  We  ha  guarantee  that  o>ir  sample  is  suHRciently  good  while 

keeping  the  time  required  for  sampling  .small.  Sampling  a  fixed  number  or 
fixed  fraction  of  jobs  does  not  work — for  some  size  we  can  have  many  jobs 
with  small  running  time  and  a  few  very  long  jobs;  in  that  case  we  are  not 
likely  to  see  any  long  job  and  then  we  have  no  useful  information  about  the 
total  work.  Instead,  we  sample  until  we  see  jobs  with  total  running  time 
exceeding  some  bound.  Intuitively,  if  there  are  only  two  long  jobs  in  the  first 
quarter  of  the  jobs,  it  is  likely  that  the  number  of  long  jobs  is  close  to  eight; 
if  there  are  two  long  jobs  among  the  first  four,  probably  about  a  hadf  of  the 
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jobs  are  long.  Even  though  we  have  a  much  larger  sample  in  the  first  case, 
Hoeffding  bounds  guarantee  that  in  both  cases  the  probability  of  a  wrong 
estimate  is  about  the  same.  Note  that  while  sampling  in  this  way,  we  may 
schedule  most  or  all  jobs  before  the  bound  is  reached. 

In  addition,  it  is  necessary  to  guarantee  that  we  get  good  estimates  for  all 
of  different  sizes  of  jobs  at  once.  This  is  important  amd  non*trivial,  since  the 
number  of  different  sizes  is  not  constant.  To  achieve  this,  we  partition  the 
mesh  for  sampling  so  that  we  schedule  in  parallel  larger  number  of  smaller 
jobs.  Therefore  we  have  a  larger  and  more  reliable  sample  for  smadler  jobs, 
and  the  total  probability  of  an  error  can  be  bounded. 

11.4.1  The  algorithm 

To  make  our  exposition  more  simple  and  clear,  we  cissume  that  the  mesh  is 
a  square  and  its  size  is  a  power  of  2  and  the  sizes  of  each  job  are  powers  of 
2.  Since  this  often  true  in  practice,  and  the  competitive  ratio  is  somewhat 
better,  this  simpler  case  can  be  of  independent  interest.  These  assumptions 
can  be  eliminated  in  following  way.  VVe  can  round  the  sizes  of  jobs  to  the 
next  larger  power  of  two  and  use  a  submesh  whose  sizes  are  multiples  of  the 
modified  sizes  of  all  jobs  (processing  the  large  jobs  separately):  this  changes 
the  efficiency  and  the  competitive  ratio  by  a  constant  factor.  It  is  also  not 
difficult  to  modify  the  algorithm  to  handle  non-square  meshes. 

Let  mi  =  n/2^  We  define  the  job  classes  (cis  in  Section  11.2)  by  = 
{(a,,6,-,L)  €  J\m.i+i  <  a,  <  mi),  and  subclasses  by  =  {(«,,  6,,  i,)  € 

<  bi  <  mi+ii}.  .Note  that  under  our  ^lssumptio^s  the  size  of  all 
jobs  in  is  exactly  mixmi+f.  I^nless  we  say  otherwise.  I  and  /'  range 

over  3  <  /  <  logn,  0  <  /'  <  logn.  Note  that  all  the  subclasses  with  /'  =  0 
contain  jobs  requiring  square  meshes,  for  I'  =  I  meshes  with  2  :  I  ratio  of 
their  dimensions,  etc. 

In  step  (1)  of  the  algorithm  we  .schedule  the  large  jobs  (the  three  job 
classes  and  Steps  (2)  and  (3)  implement  the  doubling 

strategy  to  estimate  the  longest  running  time.  At  any  point  the  estimate  is 
2^r. 
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Step  (4)  implements  the  sampling.  It  uses  a  fixed  partition  of  the  mesh 
illustrated  by  Figure  5.  To  each  subclass  >  3,  /'  >  0,  we  assign  a 

submesh  Gi,f  of  size  (4mj)x((/'+  l)mj»/4).  Each  Gtj'  is  divided  into  (/'+ 1)2^ 
submeshes  Gij>j  of  size  mixmi^it.  Hence  (/'-H)2'  jobs  of  each  subclass 
can  be  scheduled  in  parallel.  We  need  to  verify  that  these  submeshes  can  be 
placed  onto  an  nxn  mesh  so  that  they  are  pairwise  disjoint.  For  a  fixed  /,  the 
sum  of  the  heights  of  the  grids  Giji  is  bounded  by  +  l)n/2‘'*^  =  n, 

therefore  they  all  fit  into  a  submesh  of  size  4m(  xn.  The  total  width  of  these 
meshes  for  /  >  3  is  bounded  by  n,  hence  all  the  meshes  fit  into  the  nxn 
mesh. 


/'  =  2  - 
/'  =  1- 

/'  =  0- 

' - » - ' — f — ^ 

/  =  3,  4.  5,  . . . 

Figure  5:  The  partition  of  the  mesh  used  for  sampling  in  the  algorithm  SAM¬ 
PLE.  (The  thick  lines  partition  the  mesh  into  submeshes  G[j',  the  thin  lines 
partition  them  further  into  submeshes  Ciij'_j.) 

Step  (5)  computes  the  estimates  based  on  previous  sampling.  For  each  I 
and  /',  W{ji  estimates  the  total  running  time  of  all  jobs  in  not  scheduled 

before  the  step  (4).  Note  that  this  also  estimates  the  work  (after  rescaling), 
as  the  size  of  all  jobs  in  is  the  same.  wi  estimates  the  total  work  of 

jobs  in  and  w  estimates  the  total  work  of  all  unscheduled  jobs,  pi  is 
a  number  of  processors  proportional  to  wi,  zi  is  the  number  of  grids  of  size 
mixmi  actually  assigned  to  the  class  (this  rounding  guarantees  that  each 
job  fits  into  the  assigned  mesh). 
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Steps  (6)  and  (7)  schedule  the  jobs  in  areas  approximately  proportional  to 
the  estimated  work  in  each  class,  using  a  slight  modification  of  the  methods 
from  Section  11.2.  The  condition  in  (7)  and  the  step  (8)  gu2Lrantee  that  we 
sample  again  if  some  of  the  estimates  turns  out  to  be  wrong. 

Algorithm  SAMPLE 

(1)  Schedule  the  classes  and  using  ORDERED; 

while  no  job  of  nonzero  time  was  scheduled  do 

schedule  one  job; 
wait; 

let  r  be  the  maximal  running  time  of  the  jobs  scheduled  so  far; 
let  /  :=  1; 

(2)  During  the  steps  (3)  to  (7), 

if  the  running  time  of  any  job  exceeds  2^r, 
then  begin  let  /:=/+!;  goto  (3)  end; 

(3)  wait; 

(4)  For  every  /,  let  be  the  number  of  unscheduled  jobs  in 
For  time  2  •  2^r,  do 

if  is  empty  and  there  is  an  unscheduled  job  in 
then  schedule  a  random  unscheduled  job  in  onto 

wait; 

(5)  for  every  /,  if  there  are  unscheduled  jobs  in 

then  begin 

let  A;;,//  be  the  smallest  k  such  that  the  sum  of  the  running  times 
of  the  first  k  jobs  from  scheduled  during  the  preceding 

step  (4)  is  at  least  2(/'  +  l)2^2^r; 
let  «;(,/>  ;=  +  1)2^2' t / kij>; 

end; 

else  let  wij>  be  the  total  running  time  of  jobs  from  scheduled 
during  the  preceding  step  (4); 
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for  every  /,  let  wi  :=  m/ 
let  w  := 
for  every  /, 

let  Pi  :=  ^n^wijw; 
let  2/  :=  fpj/mfl; 

(6)  Partition  the  machine  into  square  submeshes  Hij,  I  >  0,  I  <  j  <  zi, 

of  size  miy.mi; 

(7)  in  parallel  for  each  I  schedule  the  cl<iss  on  the  collection  of  grids 

•  •  • )  ^1,21 }  using  ORDERED,  provided  that 
if  ((all  jobs  of  some  class  are  finished  and  the  total  work  of  the 
jobs  scheduled  during  this  step  is  less  than  u>//4) 
or  the  time  spent  in  this  step  is  ^wfri^), 
then  interrupt  all  instances  of  ORDERED; 

(8)  wait; 

if  there  are  unscheduled  jobs,  then  goto  (4). 

We  need  to  justify  the  step  (6).  As  in  Section  11.2.  it  is  sufficient  to  show 
that  the  total  area  of  all  the  submeshes  Hi,j  is  not  too  large,  since  the  meshes 
are  squares  whose  sizes  are  powers  of  two.  The  area  is  Ijouiided  by 

J]  zim]  <  J^ipi/mf  +  l)m,^  <  =  n^. 

l>3  t>3  l>3  l>3 

11.4.2  Probability  estimates 

Our  basic  tool  are  Chernoff-Hoeffding  Iwunds.  which  bound  the  prol)ability 
that  a  sum  of  random  variables  differs  significantly  from  its  mean  [Hoe63. 
HR90,  ASE92].  Many  sources  state  them  only  for  a  random  variable  S  which 
is  a  sum  of  independent  0-1  variables  X,.  However,  tlie  original  Hoeffding 
paper  [Hoe63]  proves  that  the  same  bounds  are  true  even  for  more  general 
variables  X,.  First,  it  is  sufficient  if  the  values  of  each  X,  are  from  the 
real  interval  [0,1],  not  necessarily  integers.  Second,  the  variables  can  be 
produced  by  sampling  without  replacement  from  some  universe,  in  which 
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case  the  variables  are  not  independent;  independent  variables  correspond  to 
sampling  with  replacement.  Intuitively,  both  of  these  changes  should  keep  the 
value  of  S  even  closer  to  its  mean.  Note  that  in  an  extreme  ctise  of  sampling 
without  replacement  we  obtain  all  elements  of  the  universe,  in  which  case  we 
know  the  sum  exactly. 

The  most  convenient  form  of  the  Hoeffding  bounds  states  that  for  the 
appropriate  S  and  any  0  <  £  <  1  the  following  holds 

Prob[S  <  (1  -  £)E(Sj]  < 

Prob(S  >  (1  -I-  £)E(S1]  <  e-'*E[Si/3 

An  elegant  derivation  for  independent  0-1  variables  can  be  found  in  [HR90j; 
to  modify  it  for  the  more  general  S  it  is  necessary  to  use  convexity  of  the 
function  and  Jensen’s  inequality  during  the  proof,  see  [Hoe63]. 

Sampling  without  replacement  corresponds  to  the  process  by  which  we 
schedule  the  jobs  within  one  subclass  during  the  step  (4)  of  the  algorithm 
and  hence  the  bounds  apply  in  our  case.  However,  we  need  the  following 
variation  in  which  we  stop  sampling  after  the  sum  of  samples  achieves  a 
certain  threshold.  The  threshold  has  to  be  smaller  than  the  total  sum,  so 
that  we  do  not  run  out  of  samples. 

Lemma  11.12  Let  U  be  a  set  of  r  real  numbers  from  [0,  1]  with  sum  VV .  Let 
Xi,...,Xr  be  a  sequence  of  random  variables  obtained  by  sampling  without 
replacement  out  of  the  set  U .  Let  S,  be  the.  sum  of  the  first  i  of  them.  Let 
5  <  A  <  H/  be  given.  Let  k  he  the  unique  integer  such  that  Sk_|  <  A  < 
(note  that  k  is  a  random  variable).  Then 

Prob[-(rA/2k  <W  <  'JrA/k)]  < 

Proof.  Let  q  =  r\/\V:  intuitively  o  is  the  e.xpectecl  value  of  k,  i.e..  the 
expected  number  of  samples  after  which  the  threshold  A  is  reached.  .Note 
that  E[Si]  =  iW/r  =  iX/a  for  all  i  <  r. 

If  W  <  rA/2k  then  k  <  a/2  by  the  definition  of  a.  From  the  definition 
of  k  it  follows  that  for  j3  =  [a/2J,  Sg  >  X  =  |E[Saj.  Using  Hoeffding  bound 
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we  get 


Prob[k  <  a/2]  <  Prob[S^  >  ^EtS^]]  <  <  ^-A/e 

since  the  exponent  satisfies  (1  — |)^E[S/3]  =  (1  — =  (|  +  ^  — 2)A  >  A/2; 
the  last  inequality  uses  f  <  1/2  and  the  fact  that  x  +  A  decreases  for  x  <  1. 

If  IV  >  2rA/k  then  k  >  2a.  We  put  ^  =  [2aJ.  Note  that  f  >  2  —  A  > 
2  —  ^  >  |.  By  the  definition  of  k,  S/j  <  A  =  |E[S/}|.  Using  Hoeffding  bound, 

Prob[k  >  2a]  <  Prob[S^  <  |e[S^]]  <  e-(t-i)^E[S^l/2  <  g-A/e 

since  (|  -  l)^E[S/j]  =  (f  -  1)^1-^  =  (|  +  f  -  2)A  >  A/3;  the  last  inequality 
uses  ^  >  9/5  and  the  fact  that  x  +  A  increases  for  x  >  1.  □ 

11.4.3  Expected  time  analysis 

First  we  analyze  one  pass  through  the  steps  (4)  to  (8).  Let  us  introduce  some 
notation.  Let  W;,/*  be  the  total  running  time  of  all  unscheduled  jobs  in  the 
subclass  before  the  step  (4),  Wi  =  rniYii'  the  total  work  of 

all  unscheduled  jobs  in  the  class  and  W  =  the  total  work  of  all 

unscheduled  jobs.  Let  Wl  and  W  be  the  same  quantities  restricted  to 
the  jobs  actually  scheduled  during  the  steps  (4)  to  (8).  Note  that  ty/,//,  wi 
and  w  are  estimates  of  Wiji,  Wi  and  W. 

The  first  claim  says  that  with  large  probability,  after  step  (6)  all  the 
estimates  are  sufficiently  good. 

Claim  11.13  If  no  running  time  is  larger  than  2^r.  then  after  step  (6), 
Prob[->((V/, /')i(;;.('/4  <  W/,;/  <  wtj>)]  <  1/6,  and  therefoi'e  Prob[-'((V/)u;;/4  < 
Wi  <  u;;)]  <  1/6. 

Proof.  First  we  argue  that  for  any  given  /  and  /', 

Prob[^(u;,,/-/4  <  Wij.  <  tu,.,/)]  < 
where  a  =  0.0695 


63 


If  ^i,t'  <  2(/'  +  1)2'2^t  then  wij>  —  Wiji  by  the  definition  of 
Otherwise  the  total  running  time  of  jobs  from  is  at  least  2(/'  + 

1)2^2^t.  Hence  if  we  normalize  the  running  times  of  scheduled  jobs  and 
W  =  Wiji  by  dividing  them  by  2^r,  set  r  =  r/,//,  A  =  2(/'+  1)2^  and  k  = 
we  get  exactly  the  situation  described  in  the  assumptions  of  Lemma  11.12. 
The  statement  follows  from  Lemma  11.12  by  definition  of  wij'  =  riji\2^Tf  ki^-. 
Now,  we  sum  over  ail  /  and  /', 


2  Vs 


2^  Z-  “  ^  2^  1  _  ^2V8  -  l-a  ^  l-a^' 

l>3l‘>o  />3  ^  “  1  a  1  O' 


1  —  a 


^  I 

l-a^  ^  1  -  a  '  (1  -q2)2  -  T2‘ 


The  statement  for  wi  is  a  trivial  consequence. 


□ 


Now  we  want  to  prove  that  the  algorithm  is  efficient,  and  if  the  estimates 
are  good,  the  step  (7)  schedules  all  the  jobs.  The  step  (7)  is  the  only  one 
that  could  be  inefficient,  as  the  length  of  all  the  other  steps  is  bounded  by  a 
constant  multiple  of  the  current  estimate  of  the  longest  running  time.  Since 
the  mesh  is  divided  proportionally  to  the  estimated  work,  we  expect  that 
all  the  classes  are  finished  at  about  the  same  time.  The  time  bound  in  step 
(7)  is  chosen  so  that  if  for  no  class  the  estimate  of  work  is  too  small,  all 
jobs  are  finished.  If  the  estimate  is  not  too  large,  then  the  average  efficiency 
in  that  class  is  at  least  1/4;  otherwise  the  condition  in  step  (7)  interrupts 
immediately.  From  this  it  follows  that  the  average  efficiency  is  sufficiently 
large  even  if  we  average  over  all  job  classes. 

However,  since  different  classes  can  finish  at  different  times  (within  a 
factor  of  4),  the  exact  statement  is  somewhat  tedious.  One  technical  issue  is 
that  our  estimates  include  the  jobs  that  have  already  been  scheduled  during 
the  sampling  step.  However,  since  the  length  of  the  sampling  step  is  bounded 
by  a  multiple  of  the  longest  running  time,  we  can  “credit”  this  work  to  step 
(7).  We  solve  this  formally  as  in  Section  11.2.  Instead  of  literally  proving  that 
the  time  when  the  efficiency  is  low  is  short,  we  again  prove  that  the  length 
of  the  schedule  is  bounded  by  a  constant  multiple  of  the  work  divided  by  the 
number  of  processors,  plus  an  additive  term  bounded  by  a  constant  multiple 
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of  the  longest  running  time.  The  next  claim  analyzes  one  pass  through  the 
steps  (4)  to  (8)  in  this  way  and  also  proves  that  if  the  estimates  from  the 
sampling  are  good,  the  step  (7)  schedules  all  jobs  or  ends  by  finding  a  job 
longer  than  2^r. 

Claim  11.14  (i)  The  total  time  T  of  a  pass  through  steps  (4)  to  (8)  is 
bounded  by  T  <  (4  +  +  4  •  2^t. 

(ii)  Ifwil^  <  /o''  all  I,  then  the  pass  through  steps  (4)  to  (8) 

either  ends  by  invoking  the  condition  in  step  (2),  or  schedules  all  jobs. 

Proof.  Since  the  sizes  of  all  jobs  are  powers  of  two,  the  efficiency  of  each 
instance  of  ORDERED  is  1  as  long  as  there  are  jobs  available  in  that  class. 

(i)  Let  T'  be  the  length  of  the  step  (7).  We  prove  that  for  every  /, 

w;>^-p,{r-i'T}.  (1) 

If  W[  >  wi/4,  then  W/  >  \piT',  since  T'  <  wi/pi  by  the  definition  of  pi  and 
the  condition  that  bounds  the  time  in  step  (7).  The  condition  (1)  follows. 

If  Wl  <  wi/A  then  either  not  all  jobs  of  are  finished  at  the  end  of 
step  (7),  or  the  step  ended  because  the  last  job  of  just  finished  and 
Wt  <  uji/A.  In  both  cases  at  time  2^r  before  the  end  of  the  step  there  were 
unscheduled  jobs  in  otherwise  the  some  job  is  running  for  2‘t  and  the 
step  is  interrupted  by  the  condition  in  step  (2).  Therefore  for  time  at  least 
T'  —  2^t  the  efficiency  of  the  corresponding  instance  of  ORDERED  is  1  and 
the  work  done  is  at  least  ppT'  —  2^r}.  which  proves  the  condition  ( 1). 

The  condition  (1)  implies  that  IT'  =  IT/  >  —  2^7  )  > 

—  2^r),  and  therefore  T'  <  (1  -!-  '^)W' jn^  -h  2‘r.  The  Ijound  on  T 
now  follows,  since  the  length  of  steps  (4)  and  (S)  is  bounded  by  3  ■  2^t  and 
no  time  is  spent  in  steps  (5)  I  (6). 

(ii)  Suppose  for  a  contradu  tion  that  IT;  >  ut;/ 1  for  all  /.  the  step  (7)  is 
not  interrupted  by  the  condition  in  (2).  and  does  not  schedule  all  jobs  from 
some  J7d)  In  that  case  the  step  (7)  takes  time  ^w/n‘.  .\s  the  area  assigned 
to  is  at  least  ^wi/w  and  the  efficiency  is  1,  the  total  work  of  jobs  from 

scheduled  during  (7)  is  at  least  wi.  If  iw;  >  IT/,  then  this  means  that 
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all  jobs  from  *7^'^  have  been  scheduled  (considering  that  some  positive  work 
was  also  done  during  (4),  to  break  the  equality);  a  contradiction.  □ 

Now  it  is  easy  to  prove  the  main  theorem.  We  only  need  to  take  care  of  the 
fact  that  the  sampling  step  can  be  repeated  several  times  and  of  the  doubling 
of  the  estimate  on  the  longest  running  time.  One  subtle  observation  is  that 
if  the  running  time  is  larger  than  the  current  estimate,  then  all  the  previous 
claims  still  work  until  we  actually  see  a  long  job.  This  is  justified  by  the 
following  mental  experiment:  replace  all  the  long  jobs  by  jobs  whose  length 
is  equal  to  the  current  estimate;  until  the  moment  we  see  a  job  running 
longer  than  the  estimate,  the  algorithm  behaves  exactly  the  same  way  on 
both  instances,  and  hence  all  the  bounds  must  be  true  as  well. 

Theorem  11.15  The  algorithm  SAMPLE  is  28-competitive. 

Proof.  First  we  bound  the  total  length  T,  of  steps  (.3)  to  (8)  during  which 
/  =  i.  Let  Wl'  denote  the  total  work  done  during  that  part  of  schedule.  Step 
(3)  takes  time  at  most  2‘t.  According  to  Claim  11.13,  if  no  job  longer  than 
2^r  is  found,  the  probability  that  the  steps  (4)  to  (8)  are  repeated  is  less 
than  1/6.  Therefore  the  expected  number  of  passes  through  steps  (4)  to  (8) 
is  6/5  and  hence  using  Claim  11.14  we  get 

T.  <  2'r+(^4+;i)  W,7n^  +  ^4-2‘r 

Let  W"  be  the  total  work  ot  all  jobs  and  let  i'  be  the  maximal  value  of  / 
during  the  algorithm.  Obviously  2‘'r  <  2Tmax.  To  bound  the  length  of  step 
(1),  we  use  the  fact  that  during  step  (1)  the  efficiency  is  less  than  1  for  at 
most  3r.  Hence  the  total  length  of  the  schedule  T{J)  is  bounded  by 

E[r(J-)|  <  (4  +  i)vr7n'  +  :)r+((i-i)i;2'r 
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<  (4  +  ±)w)+(6-i)2‘'-r 

<  (4  +  ^)  +  4  (e  -  g)  T^{J)  <  28r„pt( J). 


12  Higher-dimensional  meshes 

In  this  section  we  generalize  our  results  from  Section  11  to  higher-dimensional 
meshes.  Obviously,  the  lower  bound  from  Section  11.3  is  true  also  for  higher 
dimensions,  because  we  can  use  the  two-dimensional  proof  modified  so  that 
the  additional  dimensions  of  all  jobs  are  equal  to  the  dimensions  of  the  ma¬ 
chine.  Generalization  of  our  algorithms  from  Sections  11.1,  11.2  and  11.4 
requires  some  work.  In  all  three  cases  the  ideas  behind  the  algorithms  do 
not  change,  however,  to  prove  that  they  still  work  requires  some  tedious 
calculations. 

The  main  result  is  that  the  competitive  ratio  of  the  generalized  random¬ 
ized  algorithm  SAMPLE  is  a  constant  that  does  not  depend  even  on  the 
dimension  d,  if  the  dimensions  of  all  jobs  and  of  the  machine  are  are  powers 
of  two,  and  there  are  no  large  jobs  (i.e.,  no  dimension  of  any  job  is  more  than 
half  of  the  corresponding  dimension  of  the  mesh).  If  the  sizes  are  power  of 
two,  but  large  jobs  are  allowed,  the  competitive  ratio  is  0(d).  For  general 
jobs,  the  competitive  ratio  is  0(4*^). 

The  competitive  ratio  of  the  generalized  deterministic  algorithm  is 
O(y/loglog  JV)  for  a  constant  dimension  d,  which  is  optimal  up  to  a  con¬ 
stant  factor.  However,  the  competitive  ratio  depends  on  d. 

12.1  Deterministic  scheduling 

Our  algorithm  for  deterministic  scheduling  on  d-dimensional  meshes  is  a  es¬ 
sentially  generalization  of  the  algorithm  BALANCED  PARALLEL  for  two- 
dimensional  meshes  from  Section  11.1.  The  main  problem  is  that  the  number 
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of  job  classes  and  in  particular  the  number  of  classes  of  large  jobs  is  expo¬ 
nential  in  d. 

We  €issume  without  loss  of  generality  that  the  dimensions  of  each  job 
(oi, . . . ,  ad)  satisfy  ci  >  cj  >  . . .  >  For  simplicity  we  suppose  first  that 
the  dimensions  of  all  jobs  and  of  the  machine  are  powers  of  two  and  ail  the 
dimensions  of  the  machine  are  the  same,  n.  We  say  that  two  jobs  are  in  the 
same  job  class,  if  all  the  coordinates  except  for  the  last  one  are  identical. 
The  number  of  different  classes  is  (logn)*^"^. 

Now  it  is  possible  to  generalize  algorithms  from  the  previous  section  in  a 
straightforward  way.  CLASS  schedules  all  jobs  in  a  single  class  using  the  one- 
dimensionad  algorithm  ORDERED  disregarding  all  but  the  last  dimensions. 
SERIAL  schedules  adl  classes  sequentially.  PARALLEL  and  BALANCED 
PARALLEL  divide  the  mesh  into  submeshes  with  the  last  coordinate  smaller, 
and  then  proceed  similarly  to  the  two-dimensional  case. 

The  d-dimensional  algorithm  runs  in  three  phases.  In  the  first  phase 
a  d-dimensional  version  of  BALANCED  PARALLEL  is  used  to  schedule 
jobs  whose  last  dimension  is  small.  The  second  phcise  repeatedly  uses  a 
d-dimensional  version  of  PARALLEL  to  schedule  jobs  whose  last  dimensions 
get  larger  and  larger.  Finally,  the  third  ph^lse  uses  a  d-dimensional  version 
of  SERIAL,  when  PARALLEL  stops  to  make  progress. 

We  use  BALANCED  PARALLEL  with  k  =  \/log  log  n  for  all  jobs  with 
the  last  dimension  at  most  ci  <  n/(fc  •  order(^7))  <  n/(logn)‘^.  This  achieves 
competitive  ratio  of  log(order(,7))/A:  =  d\/log  logn. 

The  number  of  remaining  job  classes  is  bounded  (log((logn)‘^))'^  = 
(dloglogn)*^.  Hence  just  applying  applying  PARALLEL  once  and  then  SE¬ 
RIAL  leads  to  a  /(n)*^  term  in  the  competitive  ratio  for  some  non-constant 
function  f{n),  which  would  mean  that  the  algorithm  is  not  optimal  even 
for  constant  d.  To  avoid  that,  we  apply  PARALLEL  recursively.  In  every 
step  of  PARALLEL  the  number  of  job  classes  is  reduced  by  an  applica¬ 
tion  of  logarithm.  This  implies  that  after  i  iterations  of  PARALLEL  the 
number  of  remaining  job  classes  is  bounded  by  (d(21ogd  -f-  log*'"''^^  n))^  and 
after  log*  n  iterations  it  is  bounded  by  (2d  log  d)"^.  After  this  we  finish  the 
schedule  using  SERIAL  for  these  job  classes.  This  results  in  a  competi- 
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tive  ratio  of  0{dy/\og  log  n  +  d  log  d  log*  n  +  d  log^'^n  +  (2rflogrf)'*)  = 
0{dy/log  log  n  +  dlogd  log*  n-\-(2d  log  dY). 

If  the  dimensions  axe  not  powers  of  two,  the  efficiency  decreases  by  a 
factor  of  2**.  However,  efficiency  is  an  additive  term  in  the  competitive  ratio, 
‘  and  hence  it  is  absorbed  in  the  last  term  in  our  caise.  We  have  proved  the 
following  theorem. 

Theorem  12.1  There  exists  an  on-line  algorithm  for  scheduling  on  d- 
dimensional  meshes  of  N  processors  which  is  0{dy/\oglogN -l-dlogdlog*  A'^+ 
{dlogdY)-competitive.  If  d  is  a  constant,  the  algorithm  is  optimal  up  to  a 
constant  factor.  □ 

12.2  Off-line  scheduling 

In  this  section  we  bound  Topt  by  a  multiple  of  max(Teff,  Topt),  which  is  a 
generadization  of  the  off-line  algorithm  for  two-dimensional  meshes  from  Sec¬ 
tion  11.2.  At  the  same  time  this  generalizes  a  part  of  the  randomized  algo¬ 
rithm  SAMPLE  from  Section  11.4,  namely  the  steps  (5)  and  (7)  in  which 
we  schedule  the  jobs  based  on  the  previous  estimates  of  the  work  of  each 
job  class.  This  is  the  main  application  of  this  section:  as  opposed  to  the 
two-dimensional  case,  we  do  not  need  this  result  for  the  deterministic  lower 
bound. 

We  can  obtain  a  constant  factor  independent  of  d  if  the  dimensions  of  all 
jobs  and  of  the  machine  are  powers  of  two,  and  there  are  no  large  jobs.  i.e.. 
each  dimension  of  a  job  is  smaller  than  the  corresponding  dimension  of  the 
mesh. 

If  the  machine  is  a  cube.  i.e..  all  the  dimensions  of  the  mesh  are  equal, 
we  can  process  the  large  jobs  with  a  loss  of  only  a  factor  d  as  follows.  We 
partition  them  according  to  the  number  i  of  dimensions  that  are  same  as  the 
corresponding  dimension  of  the  mesh.  For  each  i  we  process  them  by  as  d  —  i 
dimensional  jobs,  disregarding  the  first  i  dimensions.  If  the  machine  is  not  a 
cube,  the  .=ame  process  loses  a  factor  up  to  2'^.  If  the  dimensions  of  the  jobs 
and  of  the  mesh  are  not  powers  of  two,  the  efficiency  decreases  by  a  factor 
of  4*'. 


69 


In  the  rest  of  this  section  we  demonstrate  how  to  produce  a  schedule 
with  a  length  within  a  constant  factor  of  maxi  Tig.  assuming  that  the 

machine  is  a  cube,  the  dimensions  of  the  jobs  and  of  the  machine  are  powers 
of  two  and  there  are  no  large  jobs.  We  make  the  assumption  that  the  machine 
is  a  cube  only  for  clarity,  it  is  easy  to  see  that  it  is  not  necessary. 

Two  jobs  are  defined  to  be  in  the  same  job  class  if  they  differ  only  in 
the  last  (smallest)  dimension.  As  before,  we  assign  to  each  class  a  volume 
proportional  to  its  work.  This  time  we  partition  the  whole  volume,  and 
prove  that  even  after  all  rounding  it  can  fit  into  a  constant  number  of  copies 
of  the  machine.  This  is  sufficient,  since  instead  of  scheduling  all  jobs  in 
parallel,  we  can  arrange  the  parts  of  schedule  performed  on  different  copies 
of  the  machine  sequentially.  We  can  do  this  even  in  the  case  of  the  on-line 
randomized  algorithm,  with  no  change  of  the  proofs. 

Now  we  describe  the  process  of  assigning  the  submeshes  and  packing  them 
into  the  desired  shape.  We  derive  the  estimates  on  the  additional  volume 
later. 

Instead  of  assigning  a  union  of  squares  to  a  job  cleiss,  we  assign  a  union 
of  submeshes  whose  laist  dimension  is  n  and  the  first  d  —  I  dimensions  are 
the  same  as  for  any  job  in  the  class.  Rounding  the  volume  up  uses  at  most 
one  such  submesh  for  each  subclass:  we  have  to  prove  that  the  total  volume 
of  them  is  small. 

We  continue  inductively  as  follows.  We  group  all  job  classes  with  the 
same  d  —  2  first  coordinates,  and  pack  the  submeshes  corresponding  to  them 
into  submeshes  with  the  last  two  <limensions  n  and  the  first  d  —  'l  dimensions 
same  as  the  jobs  in  those  classes.  Since  only  one  dimension  of  the  submeshes 
that  are  being  packed  varies  and  it  is  always  a  power  of  two.  all  but  one  of  the 
larger  submeshes  of  each  size  is  completely  full.  We  have  to  prove  that  the 
total  volume  of  those  submeshes  that  are  not  full  is  small.  We  continue  this 
process  with  d  —  'i  dimensions,  etc.,  until  we  get  meshes  with  all  dimensions 
n,  which  is  our  goal. 

Now  we  derive  the  estimates  on  the  total  volume.  Let 

F(i,j)=E  E  -  Z 
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Each  subclass  contains  jobs  of  size  n2~*‘  x . . .  x  for  some  particular 
sequence  I  <  l\  <■■.<  Id,  using  the  fact  that  large  jobs  are  not  allowed. 
It  is  easy  to  see  that  the  extra  volume  during  the  initial  rounding  is  most 
n'^F{d  —  1, 1),  and  the  extra  volume  during  grouping  according  to  the  first  i 
coordinates  is  at  most  n^F{i,  1).  Hence  the  total  volume  of  the  final  meshes 
is  at  most  n‘^(l  +  YltZl  1))* 

First  we  prove  by  induction  on  i  that  F{i,j)  <  .  We  use  the 


fact  that  for  any  /' 

>1, 

E2-^' 

/>/' 

Hence 

=  ^2--'''  <  (1  +2^-02--'''  < 

F{hj)  = 

f{ij)  = 

EE-"  E  2-'‘2-'*-..2-'-2-^'' 

< 

f'*"'  E  E  •  •  •  E  2-'>2-'^  •  •  •  2-'-2-^'- 

= 

= 

e2'~V(i  -  l,y  +  1) 

< 

Now  Yl'iZi  1)  ^  Tl'i=i  e^2~‘  <  and  hence  we  need  at  most  1  + 
copies  of  the  machine.  This  finishes  the  proof. 

12.3  Randomized  scheduling 

In  this  section  we  generalize  the  randomized  algorithm  SAMPLE  to  higher 
dimensions. 

We  obtain  a  constant  competitive  ratio  independent  of  d  if  the  dimensions 
of  all  jobs  and  of  the  machine  are  powers  of  two,  and  there  are  no  large  jobs, 
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i.e.,  each  dimension  of  a  job  is  smaller  than  the  corresponding  dimension  of 
the  mesh.  Even  with  these  restrictions,  this  result  seems  to  be  surprisingly 
good,  especiadly  when  compared  to  the  deterministic  algorithm,  where  the 
dependence  on  d  is  0((2dlogd)‘*) 

The  influence  of  these  restrictions  is  cinalogous  to  off-line  scheduling  in 
Section  12.2.  If  the  machine  is  a  cube,  we  can  process  the  large  jobs  with  a 
loss  of  only  a  factor  d  as  follows.  We  partition  them  according  to  the  number 
i  of  dimensions  that  axe  same  as  the  corresponding  dimension  of  the  mesh. 
For  each  i  we  process  them  as  d  —  i  dimensional  jobs,  disregarding  the  first  i 
dimensions.  If  the  machine  is  not  a  cube,  the  same  process  loses  a  factor  up 
to  2*^.  If  the  dimensions  of  the  jobs  and  of  the  mesh  jure  not  powers  of  two, 
the  competitive  ratio  decreases  by  a  factor  of  4*^. 

In  the  rest  of  this  section  we  demonstrate  how  to  modify  the  algorithm 
SAMPLE  so  that  it  achieves  a  constant  competitive  ratio  independent  of 
d  assuming  that  the  machine  is  a  cube,  the  dimensions  of  the  jobs  and  of 
the  machine  are  powers  of  two  and  there  are  no  large  jobs.  We  make  the 
assumption  that  the  machine  is  a  cube  only  for  clarity,  it  is  not  necessary. 

We  already  knov/  from  Section  12.2  how  to  schedule  the  jobs  if  we  know 
the  proportion  of  the  work  in  each  job  class.  The  fact  that  we  have  only  esti¬ 
mates  does  not  change  anything  in  the  jdgorithm  and  the  proof.  This  allows 
us  to  generalize  the  steps  (5)  to  (7).  Hence  it  only  remains  to  generalize  the 
sampling  step  (4),  in  particular  the  partition  of  the  mesh  and  the  derivation 
of  the  bounds  on  the  probability  of  a  wrong  estimate. 

Two  jobs  are  defined  to  be  in  the  same  class  if  they  differ  only  in  the  last 
dimension  and  in  the  same  subclass  if  all  their  dimensions  are  equal.  For 
1  £  ^  ^  ^  the  subclass  contains  all  jobs  of  size  mj,  x. .  .xm/^, 

where  m/  =  n/2^  Let  I  =  ^^  +  l2-\ - 1-  f^_,,  l\  =  ly  and  /'  =  1  —  /,_i  for 

i  >  1.  Note  that  >  1  for  all  i,  and  /  >  d—  1  since  we  do  not  allow  large  jobs. 
To  a  subclass  J'fh '•••>'<')  we  assign  a  submesh  of  size  m/»  x  •  •  •  x  mi>^  ^  x 

The  submesh  assigned  to  . is  divided  into  of  submeshes  of 

size  mi,  x-  •  -  which  means  that  we  schedule  jobs  from  . 

in  parallel. 

It  is  easy  to  verify  by  induction  that  for  every  i  and  /i, . . . ,  the  sub- 
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meshes  for  all  /j+i, ...  ,1^  fit  into  two  meshes  of  size  liX-  ■  xlixnx-  •  xn,  and 
hence  ail  of  the  submeshes  fit  on  two  copies  of  the  machine.  We  sample  first 
in  parallel  for  2dl  submeshes  on  one  copy  of  the  machine,  then  in  parallel  for 
all  submeshes  on  the  other  copy  of  the  machine. 

If  we  sample  for  a  sufficient  constant  multiple  of  the  estimate  of  the 

maximal  time,  the  probability  that  the  estimate  for  . is  wrong  is 

bounded  by 

la  , 

where  a  is  a  small  positive  constant  which  we  choose  later.  Similarly  to  the 
proof  of  Claim  11.13,  the  total  error  for  each  cleiss  is  bounded  by 


a 


2l+l-d 


The  number  of  nondecreasing  sequences  of  positive  integers  of  length 
d  —  1  with  sum  /  is  at  most  (in  fact,  this  is  the  partition  number,  which 
is  known  to  be  however,  we  do  not  need  such  a  strong  bound  here), 

hence  the  number  of  different  classes  with  the  same  I  is  bounded  by 
Thus  the  total  probability  of  an  error  is  bounded  by 


1  y 

l>d-l 


2^2-o" 

i>l 


<  -1  ^  qJ  < 
}>i 


4a^ 

1  —  tt  ’ 


where  in  the  first  inequality  we  use  the  fact  that  a  sequence  consisting  of 

repeated  twice,  tv'  repeated  four  times . tv^'  repeated  2‘-times . is 

bounded  by  a  geometric  sequence.  For  a  sufficiently  small  a.  this  bound  is 
at  most  1/2,  and  hence  with  probability  1/2  all  of  the  estimates  are  correct. 
The  rest  of  the  proof  follows  as  for  the  two-dimensional  meshes. 
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Part  III 

Scheduling  parallel  jobs  with 
dependencies 

13  Scheduling  on  PRAMs  with  virtualiza¬ 
tion 

The  most  important  result  of  this  section  is  cin  optimal  (2  +  ^)-competitive 
aJgorithm  for  scheduling  on  PRAMs,  where  4>  =  (v/5  —  l)/2  «  0.618  is  the 
golden  ratio.  We  extend  this  result  and  give  optimal  bounds  for  the  restricted 
scheduling  problem  in  which  each  job  requests  at  most  \N  processors,  0  < 
A  <  1.  In  both  cases  we  give  matching  lower  and  upper  bounds. 

This  result  improves  the  best  known  approximation  ratio  of  3  for  the  same 
problem  achieved  by  Wang  and  Cheng  [WC92].  Our  algorithm  improves  their 
approximation  ratio  and  in  addition  is  on-line,  whereas  their  approximation 
algorithm  uses  the  information  about  the  running  times  for  scheduling. 

First  we  present  our  algorithm.  The  algorithm  is  greedy;  whenever  the 
number  of  available  processors  is  larger  than  requested  by  a  Job,  some  job  is 
scheduled.  In  addition,  if  the  efficiency  is  lower  than  some  constant  o,  some 
job  is  scheduled  even  if  we  have  to  use  virtualization.  The  constant  o  is  a 
parameter  of  the  algorithm  which  we  will  optimize  later.  It  always  satisfies 

1/2  <  Q  <  1. 

Algorithm  PRAM(q) 

while  not  all  joos  are  finished  do 

if  some  available  job  ./  requests  p  processors  and  p  processors  are  avail¬ 
able, 

then  schedule  ./  on  the  p  processors; 

else  if  the  efficiency  is  less  than  a  and  some  job  is  available 


75 


then  schedule  a  job  on  all  available  processors  (using  virtualiza¬ 
tion). 

Theorem  13.1  Suppose  that  each  job  requests  at  most  \N  processors.  Then 
PRAM{a)  is  a -competitive  for  <t  =  2  -H  ond  a  =  I  —  A  -I-  |\/4A^  -I- 1. 

Remark.  The  above  a  and  <t  satisfy  =  1  -I-  ^  and  a*  -I-  (2A  —  l)a  —  A  =  0 
or  equivalently  (1  —  Q')(^  -H  })  =  1. 

Corollary  13.2  If  the  number  of  requested  processors  is  unrestricted  (\  = 
\),  then  PRAM{<i>)  is  {2  +  (f>)-competitive,  where  <i>  =  (-\/5  —  l)/2  is  the  golden 
ratio. 

Proof  of  Theorem  13.1.  First,  observe  that  at  any  time  when  some  job 
is  available,  the  efficiency  is  at  least  a,  since  otherwise  another  job  would  be 
scheduled.  Let  T  be  the  total  time  during  the  schedule  when  the  efficiency 
is  at  least  a,  T'  the  time  when  the  efficiency  is  between  1  —  a  and  a,  and 
T"  =  r<(i_a)  the  remaining  time.  If  some  job  is  scheduled  on  less  than  the 
requested  number  of  processors,  then  it  is  scheduled  on  at  least  (1  —  a) N 
processors.  Therefore  no  job  running  during  T"  is  slowed  down  and  any  job 
running  during  T'  is  slowed  down  by  at  most  a  factor  of  During  T' 
and  T"  there  is  no  job  available,  hence  by  Lemma  7.2  we  get  y"  < 

Txaax  ^  Tjpt.  Because  the  efficiency  of  the  optimal  schedule  cannot  be  greater 
than  one,  oT  +  (1  —  o.)T'  <  Topt-  By  adding  the  first  inequality  and  the  last 
one  divided  by  a,  we  obtain  T  -f-  (1  —  o;)(^  +  j)T'  +  T"  <  {I  +  ^)T  Since 
(1  —  a)(L  +  i)  =  1,  the  length  of  the  schedule  is  bounded  by  T  -|-  i  f"  < 
(1  +  j)Topt  =  o-Topt  and  the  algorithm  is  (7-competitive.  □ 

Now  we  prove  a  matching  lower  bound. 

Theorem  13.3  Suppose  that  the  largest  number  of  processors  requested  by 
a  job  is  XN ,  where  0  <  A  <  1,  and  N  is  the  number  of  processors  of  the 
machine.  Then  no  on-line  scheduling  algorithm  achieves  a  better  competitive 
ratio  than  cr  =  2  +  for  all  'V. 

Proof.  We  first  present  the  proof  for  fully  on-line  algorithms:  it  is  divided 
into  two  cases,  A  =  1  and  A  <  1.  Then  we  sketch  how  it  can  be  generalized 
to  all  on-line  algorithms. 
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The  case  of  A  =  1  The  strategy  of  the  adversary  is  to  either  keep  the 
efficiency  of  the  machine  under  4>,  or  force  the  algorithm  to  schedule  a  job 
on  a  number  of  processors  that  is  smaller  than  the  requested  number  by  a 
factor  of  at  least  2  +  <f>- 

The  job  system  used  by  the  adversary  is  illustrated  on  Figure  6.  It  has 

N  processors 

. - * - , 

1  processor 

running  time  4>IT  - 

t 


!□□□ ...  □ 


running  time  0 
running  time  T 


Figure  6:  The  job  system  useA  in  the.  proof  of  the  loirer  bound  for  on-line 
scheduling  on  PRAMs  with  im'tuatizntion.  (The  boxes  denote  the  jobs,  the 
vertical  dimension  is  their  running  time  and  the  horizontnl  dimension  is  their 
size.) 

I  levels:  each  level  has  [cOiVJ  +  2  sequential  jobs  and  one  parallel  job  of  size 
N.  The  parallel  job  depends  on  one  of  the  sequential  jobs  from  the  same 
level;  all  sequential  jobs  depend  on  the  parallel  job  from  the  previous  level. 
In  addition  there  is  one  more  sequential  job  dependent  on  the  last  parallel 
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job.  (See  Figure  6.) 

Now  we  describe  the  adversary’s  strategy  for  assigning  the  running  times. 
The  adversary  can  enforce  that  at  each  level  the  pcirallel  job  always  depends 
on  the  sequential  job  that  was  scheduled  last.  This  is  possible  since  by  our 
cissumption  the  algorithm  is  fully  on-line  and  hence  cannot  distinguish  the 
sequential  jobs  from  each  other. 

If  a  parallel  job  is  scheduled  on  fewer  than  ( ].  —  <f>)N  processors,  it  is  slowed 
down  by  a  factor  greater  than  1/(1  —  <!>)■  In  this  case  the  adversary  assigns 
a  sufficiently  long  time  to  this  job  and  removes  all  other  jobs.  Therefore  the 
competitive  ratio  is  at  least  l/{l  —  (p)  —  2  +  <f>  and  we  are  done. 

Otherwise  the  adversary  assigns  the  running  times  of  all  the  parallel  jobs 
and  all  the  sequential  jobs  on  which  the  parallel  jobs  depend  to  0.  The 
running  times  of  aJl  other  sequential  jobs  are  set  to  T  for  some  fixed  T. 
with  the  exception  of  the  last  sequential  job,  which  has  running  time  T'  = 
(plT.  Because  of  the  dependencies  and  the  assumption  that  no  parallel  job  is 
scheduled  on  fewer  than  (l  —  <f))N  processors,  no  parallel  job  can  be  scheduled 
earlier  than  time  T  after  the  previous  parallel  job  has  finished.  Therefore  the 
schedule  takes  at  least  time  IT  -f-  T'  =  (1  -I-  <p)lT. 

The  optimal  schedule  first  schedules  all  jobs  with  time  0  and  then  the 
sequential  job  with  time  T'  in  parallel  with  all  other  sequential  jobs.  The  time 
needed  to  schedule  all  other  sequential  jobs  in  parallel  on  .V  —  I  processors 
is  +\)T/{N  —  I)],  which  is  arbitrarily  close  to  olT  for  sufficiently 

large  N  and  /.  Therefore  the  length  of  the  schedule  is  arbitrarily  close  to 
(i)lT  and  the  competitive  ratio  i.s  arbitrarily  clo.se  to  1 1  -f-  o)/o  =  2  -r  o  Tliis 
finishes  the  proof  for  A  =  1. 

The  case  of  A  <  1  The  proof  of  this  case  is  more  complicated.  The  method 
used  in  the  previous  case  does  not  lead  to  the  optimal  result.  However,  if  it 
is  iterated  with  replaced  in  the  /  th  iteration  by  a  carefully  chosen  a,,  then 
the  upper  bound  is  matched  in  the  limit. 

The  adversary  uses  a  job  system  similar  to  the  one  in  the  previous  ca.se. 
Given  a  sequence  ai,a2, —  the  job  system  has  a  phase  for  each  n,.  Phase 
i  has  /  levels,  and  each  level  has  [a,:VJ  +  2  sequential  jobs  and  one  parallel 
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job  of  size  [AiVJ .  The  dependencies  are  the  same  as  in  the  previous  case,  i.e., 
ezwdi  parzdlel  job  depends  on  one  sequential  job  from  the  same  level  and  all 
sequential  jobs  depend  on  the  parallel  job  from  the  previous  level.  Also  the 
adversary’s  strategy  is  similar.  The  running  times  of  all  parallel  jobs  and  all 
sequential  jobs  on  which  the  parallel  jobs  depend  are  set  to  0;  the  running 
times  of  all  other  sequential  jobs  are  T,  except  for  the  jobs  from  the  last 
level.  This  scheme  only  changes  if  some  parallel  job  is  scheduled  on  too  few 
processors. 

Let  <T  >  2  be  the  lower  bound  on  the  competitive  ratio  that  we  want  to 
achieve.  We  choose  ai,a2, . . .  so  that  if  the  parallel  job  of  the  i-th  level  is 
scheduled  together  with  more  than  OjiV  sequential  jobs,  it  is  slowed  down  too 
much.  Let  A,  =  A  Let  Oi  be  such  that  <r  =  eind  <t  = 

for  t  >  1.  All  a,  and  A,  are  between  0  and  1.  Both  sequences  are  decreasing 
to  the  same  limit  L  satisfying  a  = 

First  suppose  that  no  parallel  job  is  scheduled  earlier  than  time  T  after 
the  previous  parallel  job  has  finished.  Then  the  schedule  takes  time  T  for 
each  level  and  the  average  efficiency  of  the  first  i  phases  is  at  most  A,  +  y- 
After  sufficiently  many  phases  this  is  arbitrarily  close  to  L  +  At  that 
point  the  adversary  stops  the  process  and  assigns  a  nonzero  time  T'  to  a 
single  sequential  job  and  running  time  0  to  all  the  remaining  jobs.  The  time 
T'  is  chosen  so  that  the  optimal  schedule  for  all  the  previously  scheduled  jobs 
takes  time  T'  on  iV  —  1  processors.  This  forces  the  competitive  ratio  to  be 
arbitrarily  close  to  1  +  2;- 

Now  suppose  that  some  parallel  job  ./  of  phase  i  is  scheduled  early.  Then 
J  is  scheduled  on  at  most  ( 1  —  oii)N  processors.  We  prove  that  the  adversary 
can  achieve  a  competitive  ratio  arbitrarily  close  to  <r.  For  i  =  1,  the  job 
J  is  slowed  down  by  a  factor  of  =  cr,  so  the  adversary  just  runs  J 
long  enough.  For  f  >  1,  let  T  be  the  time  when  J  is  scheduled  and  let  A 
be  the  average  efficiency  of  the  schedule  before  T.  Obviously  A  <  A,_i. 
The  adversary  removes  all  jobs  that  have  not  been  scheduled  and  sets  the 
running  time  of  J  to  T'  —  The  optimal  schedule  runs  first  all  jobs 

with  running  time  0  and  then  all  sequential  jobs  in  parallel  with  the  parallel 
job  with  running  time  T'.  The  time  T'  is  chosen  so  that  the  length  of  the 
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optimal  schedule  is  within  an  arbitrarily  small  factor  of  T'  for  large  /  and  N . 
The  schedule  generated  by  the  on-line  algorithm  takes  time  f  -I-  = 

=  (tT'.  So  the  competitive  ratio  is  arbitrarily 

close  to  {f  +  >'a. 

We  have  shown  that  if  we  chose  a  such  that  <t  =  no  on-line  algorithm 
has  competitive  ratio  smaller  than  a.  We  know  that  a  —  -|-  The 

substitution  of  cr  =  1  -f-  ^  and  a  short  calculation  shows  that  the  condition 
for  L  is  equivalent  to  the  equation  for  a  in  Theorem  13.1.  Therefore  a  = 
2  -f  solution  of  the  equation. 

General  on-line  algorithms  We  need  to  modify  the  job  system  to  handle 
the  case  where  the  on-line  algorithm  knows  the  job  system  in  advance.  We 
generate  sufficiently  many  copies  of  each  job,  so  that  the  graph  is  very  sym¬ 
metric  £ind  the  scheduling  algorithm  cannot  take  advantage  of  its  additional 
knowledge.  More  precisely,  the  new  job  system  is  a  tree  of  the  same  depth 
and  each  parallel  job  has  the  same  i  ,n-out  as  before.  There  is  one  parallel 
job  dependent  on  each  sequential  job  except  for  the  last  level.  So  instead 
of  a  constant  width  tree  we  have  a  tree  which  is  exponentially  larger.  The 
adversary’s  strategy  is  the  same  except  for  the  following  modification.  When 
a  sequential  job  is  scheduled  and  it  is  not  the  last  job  of  its  level,  then  it  and 
all  jobs  in  the  subtree  dependent  on  it  are  cissigned  time  0.  Thus  both  the 
resulting  schedule  and  the  optimal  schedule  have  the  same  length  as  in  the 
fully  on-line  case,  and  the  lower  bound  holds.  □ 

14  Scheduling  on  meshes  and  hypercubes 
with  virtualization 

In  this  section  we  show  that  the  optimal  competitive  ratio  on  one-dimensional 
meshes  is  Q(  )  if  both  dependencies  and  virtualization  are  allowed. 

This  competitive  ratio  can  be  achieved  by  a  deterministic  algorithm,  and 
our  lower  bound  holds  even  for  randomized  algorithms.  This  proves  that 
in  this  case  randomization  does  not  help,  as  opposed  to  scheduling  without 
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dependencies. 

The  gap  between  the  constant  competitive  ratio  for  PRAMs  and  the 
optimal  competitive  ratio  for  one-dimension<d  meshes  is  quite  big. 

This  is  in  sharp  contrzist  to  scheduling  with  no  dependencies,  where  we  are 
able  to  achieve  a  constant  competitive  ratio  for  one-dimensional  meshes,  only 
slightly  larger  than  for  PRAMs.  This  shows  that  the  influence  of  the  network 
topology  is  amplified  by  the  presence  of  dependencies. 

For  higher-dimensional  meshes  the  algorithm  can  be  generalized  and  is 
O((i^^;^)‘')-competitive.  For  hypercubes  a  similar  algorithm  is  0{  )' 

competitive.  However,  for  these  cases  we  do  not  have  a  matching  lower  bound 
only. 

14.1  Algorithms 

First  we  present  an  algorithm  for  a  d-dimensional  hypercube.  As  in  Section  9, 
we  say  that  a  d-dimensional  subcube  is  normal  if  the  coordinates  of  all  its 
processors  are  identical  except  for  possibly  the  last  d  coordinates.  Let  h  be 
the  smallest  power  of  two  such  that  hlogh  >  d.  Note  that  h  =  O(j^).  We 
partition  the  jobs  into  h  job  classes  Ji,  I  <  i  <  h.  job  is  in  Ji  if  it  requests 
a  hypercube  whose  dimension  is  between  (d-l- 1  — « log  h)  and  (d  —  (i— 1 )  log  h). 
The  hypercube  is  partitioned  into  li  normal  (d— log /t)-dimensional  subcubes 
Ml, ... ,  Mh-  Jobs  from  Ji  are  only  scheduled  on  M,. 

Algorithm  HYPERCUBE 
while  not  all  jobs  are  finished  do 
for  all  i  do 

if  there  is  a  job  in  JT,  available  and  normal  (d  — /  log  /?.)-dimensional 
subcube  in  .Vf,  is  available 
then  sche<lule  that  jol>  on  that  subcube: 

Theorem  14.1  The  competitive  ratio  of  HYPERCUBE  for  a  d-dimensional 
hypercube  with  N  =  2'^  processors  is  0{^)  = 
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Proof.  The  algorithm  assigns  a  (</  — tiog  A)-climensional  subcube  to  each 
job  from  Ji.  First,  this  implies  that  if  there  is  any  job  available  in  Ji,  the 
efficiency  of  the  subcube  Af,  is  equal  to  1,  and  the  overall  efficiency  is  at  least 
l/h.  Second,  no  job  is  slowed  down  by  a  factor  greater  than  h  and  hence 
from  Lemma  7.2  it  follows  that  <  hTm^-  Hence,  by  Lemma  7.1,  the 
competitive  ratio  is  0{h)  =  0{  j^)  =  0(  □ 

A  similar  algorithm  can  be  used  for  the  d-dimensional  mesh  of  iV  =  n’^ 
processors.  Let  k  be  the  smallest  integer  such  that  fc*  >  n.  Note  that 
k  =  The  jobs  2u:e  partitioned  into  h  =  k'^  job  classes  J7i,  i  = 

(ii, . . . ,  ij),  1  <  ij  <  k.  A  job  belongs  to  Si  if  it  requests  a  submesh  of  size 
(oi,  02,...,  oj)  such  that  n/k‘^  <  oj  <  n/k'^~^ .  The  mesh  is  partitioned  into 
h  submeshes  M,  of  size  \n/k\  x  •  •  •  x  [n/A:J.  The  jobs  from  J,  are  scheduled 
on  submesh  Af,  only. 

Algorithm  MESH 

while  not  all  jobs  are  finished  do 
for  all  i  =  (ii, - i^)  do 

if  there  is  a  job  in  S  available  and  a  x  •  •  •  x  [n/fc’*']  subinesh 

in  Mi  is  available 

then  schedule  that  job  on  such  a  submesh  with  the  smallest  co¬ 
ordinates; 

The  proof  of  following  theorem  is  similar  to  that  of  Theorem  14.1  and  is 
omitted. 

Theorem  14.2  MESH  IS  0{ ( i„g*|^^v  )'^)-competitive  for  a  d-dimensional 
mesh  of  N  =  n’^  processors. 

14.2  Lower  bound 

In  this  section  we  prove  that  not  even  a  randomized  algorithm  for  one¬ 
dimensional  meshes  can  achieve  a  better  competitive  ratio  than  Q(  )• 

Hence  our  deterministic  algorithm  for  one-dimensional  meshes  from  Sec¬ 
tion  14  is  within  a  constant  factor  of  the  optimal  competitive  ratio  and  even 
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randomization  cannot  improve  it.  This  is  in  contrast  to  scheduling  without 
dependencies  on  two-dimensional  meshes  studied  in  Section  11,  where  we 
have  shown  that  randomization  can  help  significantly. 

Our  approach  to  this  lower  bound  is  similar  to  the  method  used  in 
Section  11.3  to  prove  a  lower  bound  for  deterministic  scheduling  on  two- 
dimensional  meshes  without  dependencies.  The  adversary  again  tries  to  block 
a  large  fraction  of  processors  by  jobs  that  only  use  a  small  fraction  of  all  pro¬ 
cessors.  Here  we  can  prove  a  larger  lower  bound,  since  dependencies  give  the 
adversary  more  control  over  the  size  of  available  jobs. 

However,  it  is  considerably  more  difficult  to  prove  a  lower  bound  for 
randomized  algorithms  than  for  deterministic  algorithms.  For  deterministic 
algorithms,  the  adversary  can  simulate  the  scheduling  algorithm  and  hence 
its  actions  can  depend  on  the  actions  of  the  scheduler.  In  contrast,  for  ran¬ 
domized  algorithms,  we  have  to  specify  the  running  times,  or  at  least  their 
distribution,  in  advance,  since  the  adversary  hca  no  access  to  the  random 
bits  of  the  algorithm.  This  significantly  restricts  the  adversary  as  opposed 
to  our  proof  of  the  lower  bound  in  Section  11.3,  where  it  was  crucial  to  use 
the  possibility  of  setting  the  running  times  according  to  the  actions  of  the 
algorithm. 

From  a  technical  point  of  view  it  is  interesting  that  the  dependencies  give 
to  the  adversary  sufficient  power  to  handle  randomization.  The  lower  bound 
for  deterministic  scheduling  on  one-dimensional  meshes  with  dependencies 
in  [FKST93]  uses  a  very  similar  technique  to  the  bound  from  Section  11.3  for 
scheduling  without  dependencies,  yet  the  first  one  generalizes  to  randomized 
algorithms  and  the  other  one  does  not. 

Another  additional  technical  difficulty  is  that  the  on-line  algorithm  can 
use  virtualization.  Virtualization  caused  no  problem  in  the  deterministic 
case,  since  we  could  argue  that  if  any  job  is  scheduled  on  a  small  number  of 
processors,  we  just  assign  it  a  long  running  time.  This  argument  no  longer 
works,  since  we  have  to  commit  to  the  distribution  of  running  times  before¬ 
hand.  Thus  arguing  about  a  single  job  is  not  sufficient.  We  have  to  argue 
that  if  the  algorithm  schedules  many  jobs  using  virtualization,  with  high 
probability  one  of  them  has  long  running  time  under  the  distribution  of  run- 
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ning  times  that  we  choose.  To  make  this  argument  work,  we  have  to  set  the 
parameters  in  our  proof  very  carefully. 

The  key  technical  tool  that  makes  the  lower  bound  work  for  randomized 
algorithms  is  Lemma  14.3.  It  says  that  if  the  2dgorithm  schedules  many  jobs 
at  once,  and  we  choose  some  small  fraction  of  them  at  random,  most  likely 
that  fraction  will  block  a  relatively  large  number  of  processors. 

We  give  the  proof  only  for  fully  on-line  algorithms.  It  can  be  generalized 
to  all  on-line  algorithms  by  the  method  used  in  the  proof  of  Theorem  13.3. 

The  job  system  and  the  off-line  schedule 

Let  D  be  the  smallest  integer  such  that  >  iV;  let  T  —  D*  and  5  =  T^. 
We  assume  without  loss  of  generality  that  N  —  and  N  is  sufficiently 
large.  Note  that  D  =  0(  ^ 

The  job  system  used  in  the  proof  is  illustrated  on  Figure  7.  It  is  a  tree  of 
depth  similar  to  a  job  system  used  in  the  proof  of  Theorem  13.3.  All  jobs 
on  level  iD  j  of  the  tree  for  0  <  e.j  <  D  request  processors:  there  are 
TN/D^S^  -I- 1  of  them.  We  assign  the  running  times  at  random,  thus,  strictly 
speaking,  we  have  not  a  single  instance  of  the  problem  but  a  distribution  on 
some  subset  of  instances.  For  each  level  the  running  time  is  0  for  a  single 
randomly  chosen  job,  and  for  the  other  jobs  it  is  T  with  probability  l/T  and 
1  otherwise.  All  jobs  on  a  given  level  depend  on  the  job  with  running  time  0 
from  the  previous  level,  there  are  no  other  dependencies. 

We  call  the  jobs  with  running  time  0  critical. 

For  each  level,  the  total  number  of  processors  requested  by  non-critical 
jobs  is  TN/D^.  Thus  the  work  of  the  jobs  on  each  level  is  at  least  TNfD^ 
and  the  expected  work  on  each  level  is  less  than  2TNf  D^. 

First  we  bound  the  length  of  the  optimal  schedule.  The  optimal  schedule 
first  schedules  the  chain  of  critical  jobs;  this  takes  no  time  cis  their  running 
time  is  0.  The  remaining  jobs  are  independent  of  each  other,  hence  we  can 
schedule  them  in  time  0(max(reff,  Tmax))  using  the  results  on  scheduling 
without  dependencies  from  Section  10.  Obviously  Tmax  =  T.  The  total 
expected  work  is  less  than  2T N  and  thus  the  expected  length  of  the  optimal 
schedule  is  0{T). 
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critical  jobs  non-critical  jobs 


TN/D^  processors  requested  on  each  level 


r  levels 


Figure  7:  A  typical  instance  of  a  job  system  used  in  the  proof  of  the  lower 
bound  for  randomized  scheduling  on  one-dimensional  meshes.  (The  boxes  de¬ 
note  the  jobs,  the  vertical  dimension  is  their  running  time  and  the  horizontal 
dimension  is  their  size.) 
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An  overview  of  the  proof 

In  the  rest  of  the  proof  we  show  that  the  expected  length  of  the  schedule 
generated  by  any  randomized  on-line  algorithm  is  at  least  il{DT),  where  the 
expectation  is  taken  over  both  the  random  bits  of  the  algorithm  and  all  the 
instances  of  the  job  system  as  chosen  randomly  by  the  adversary.  This  is 
sufficient  to  conclude  that  the  competitive  ratio  is  at  least  fl{D). 

It  is  crucial  that  the  adversary  gives  the  jobs  on  each  level  in  random 
order.  Since  they  are  indistinguishable  to  the  on-line  algorithm,  no  matter 
what  the  algorithm  does,  they  will  be  scheduled  in  a  random  order.  In 
particular  the  expected  fraction  of  the  non-critical  jobs  from  the  given  level 
scheduled  before  the  critical  one  is  1/2. 

The  on-line  algorithm  has  to  schedule  the  critical  job  on  the  given  level 
before  it  can  schedule  any  jobs  on  the  next  level.  If  the  non-critictd  jobs 
scheduled  before  the  critical  job  on  the  given  level  are  scheduled  on  some 
contiguous  segment  of  the  mesh,  we  expect  that  there  on  most  segments  T 
times  larger  than  the  size  of  a  job  there  will  be  at  least  one  job  with  running 
time  T.  If  this  is  true,  then  such  a  segment  cannot  be  used  for  time  T 
for  scheduling  jobs  at  least  S  times  longer.  Lemma  14.3  proves  a  similar 
statement  for  a  general  rase  in  which  the  jobs  are  not  necessarily  scheduled 
on  one  contiguous  segment  of  the  mesh. 

Therefore  for  each  D  consecutive  levels  with  increasing  size  of  jobs  the 
space  cannot  be  reused  efficiently.  .A  constant  fraction  of  levels  uses  only  a 
small  fraction  of  the  machine,  namely  0{.\fD)  processors.  Since  the  posi¬ 
tion  of  the  critical  job  is  random,  the  expected  work  of  the  jobs  scheduled 
before  the  critical  job  is  il{TN/ D^)  on  any  of  these  levels.  For  .V  sufficiently 
large  the  fraction  of  these  jobs  that  can  be  scheduled  in  parallel  is  negligi¬ 
ble.  Therefore,  if  they  are  schedtiled  on  0{N/D)  processors,  the  expected 
time  until  the  critical  job  is  scheduled  is  at  leiist  Q{T/ D).  Thus  the  on-line 
algorithm  needs  expected  time  of  DT)  for  all  levels. 

These  arguments  get  more  complicated  if  the  algorithm  uses  virtualiza¬ 
tion.  We  no  longer  can  completely  avoid  scheduling  jobs  in  a  space  which  is 
intuitively  unusable,  since  with  virtualization  it  is  always  possible  schedule 
a  large  job  on  an  arbitrarily  small  segment.  However,  if  this  happens  too 
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often,  with  large  probability  one  of  the  jobs  that  use  virtualization  has  long 
running  time.  Our  parameters  are  carefully  chosen  so  that  this  argument  is 
sufficient;  in  fact,  this  is  the  main  reason  why  5  is  as  large  as  without 
virtualization  a  smaller  power  would  be  sufficient. 

Simplifying  assumptions  about  the  on-line  algorithm 

We  first  introduce  more  terminology  and  make  some  eissumption  about  the 
behavior  of  the  algorithm.  These  assumption  tire  possible,  since  we  can 
modify  any  on-line  algorithm  so  that  it  satisfies  them  and  the  competitive 
ratio  is  smaller  or  only  slightly  larger. 

We  divide  the  schedule  into  phases  and  subpheises  as  follows.  The  sub¬ 
phase  j  of  the  phase  i,  0  <  i,j  <  D,  is  the  time  interval  during  which  the 
first  level  with  an  unscheduled  critical  job  is  level  iD+j  (for  convenience  we 
number  the  levels,  phases  and  subphases  from  0). 

We  use  the  phrase  just  before  the  current  subphase  to  refer  to  the  begin¬ 
ning  of  the  subphase  if  it  is  the  subphase  0  of  any  phase,  or  to  the  end  of  the 
previous  subphase  otherwise. 

We  assume  that  during  subphase  j  of  phase  i  only  the  jobs  from  level 
iD  -I-  j  of  the  job  system  are  scheduled.  This  assumption  is  possible  with¬ 
out  loss  of  genercility,  since  once  the  critical  job  is  scheduled,  the  on-line 
algorithm  knows  that  no  other  jobs  depend  on  the  remaining  jobs  of  this 
level.  Thus  these  remaining  jobs  from  all  levels  can  be  scheduled  together  at 
the  end  of  the  schedule  by  the  constant-competitive  algorithm  for  schedul¬ 
ing  without  dependencies.  This  changes  the  competitive  ratio  only  by  an 
additive  constant. 

Since  the  jobs  on  each  level  are  ordered  randomly  by  the  adversary,  no 
matter  what  the  actions  of  the  randomized  algorithm  are,  the  jobs  from  each 
level  are  scheduled  in  a  random  order.  Since  we  average  over  all  instances  of 
the  job  system  generated  by  the  adversary,  it  is  equivalent  to  assume  that 
the  running  times  of  the  jobs  are  assigned  as  follows.  At  the  beginning  of 
each  subphase  we  decide  randomly  the  position  in  which  the  critical  job  on 
that  level  will  be  scheduled.  Then  whenever  a  job  is  scheduled,  if  it  has  the 
correct  position,  it  has  running  time  0.  Otherwise  we  randomly  eissign  its 
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running  time  to  be  T  with  probability  1/T  or  1  otherwise.  We  make  the 
random  decision  about  the  running  time  of  a  non-critical  job  only  at  the 
moment  when  that  job  would  finish  if  its  running  time  were  1  or  at  the  end 
of  the  subphase.  (Note  that  the  actual  running  time  can  be  increased  by 
virtualization.)  Since  the  algorithm  is  on-line,  it  does  not  know  the  running 
times  of  jobs  until  they  finish,  and  thus  this  delay  of  the  decision  does  not 
change  its  actions.  All  random  choices  that  we  make  aire  independent.  Of 
course,  the  adgorithm  can  maice  some  additional  random  choices. 

A  non-critical  job  is  called  undecided  if  we  did  not  yet  aissign  its  running 
time. 

We  maice  two  assumptions  related  to  virtuadization.  Both  of  these  as¬ 
sumptions  and  their  use  later  are  very  relaoced.  We  could  tighten  them  by 
giving  precise  constants  in  places  where  we  use  asymptotic  notation,  however 
the  improvement  in  the  final  result  obtained  by  that  would  be  very  small. 

First,  we  assume  that  no  non-critical  job  is  scheduled  on  at  most  o{l/DT) 
fraction  of  the  number  of  processors  it  requests.  Otherwise  its  expected 
running  time  is  at  least  u;(DT).  In  that  case  the  on-line  algorithm  could  just 
use  the  deterministic  0(£))-competitive  algorithm  for  the  remaining  jobs. 
The  schedule  would  finish  in  time  0{DT),  and  hence  the  performance  of  the 
on-line  algorithm  would  improve. 

Second,  we  assume  that  it  never  happens  that  there  are  T  undecided 
jobs  each  running  on  at  most  o{lfD)  fraction  of  the  number  of  processors 
requested  by  it.  Otherwise  the  probability  that  none  of  these  jobs  will  have 
running  time  T  is  (1  —  1/T)^  <  l/e,  since  their  running  times  are  chosen 
independently.  Hence  with  at  least  a  constant  probability  one  of  these  jobs 
will  have  running  time  T,  and  since  it  is  scheduled  on  o(lf  D)  fraction  of  the 
requested  processors,  the  length  of  the  schedule  is  at  least  u>{DT).  Thus  the 
on-line  algorithm  again  performs  better  if  it  uses  the  deterministic  algorithm 
for  the  remaining  jobs. 

The  measure  of  progress 

We  say  that  a  processor  is  used  if  it  is  assigned  to  a  job  with  running  time  T 
scheduled  during  the  current  phase.  During  the  subphase  j  of  any  phase  we 
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say  that  a  processor  is  blocked  if  the  length  of  the  largest  segment  of  unused 
processors  containing  it  is  at  most  D^TS^  (a  used  processor  is  considered 
blocked).  The  blocked  space  is  defined  to  be  the  number  of  blocked  processors. 

From  the  definition  of  the  blocked  space  and  our  assumptions  it  follows 
that  in  the  subphase  j  of  any  phase  no  job  of  size  is  scheduled  in  the  space 
that  was  blocked  just  before  the  current  subphase  as  long  as  the  length  of  the 
current  phase  is  at  most  T.  This  is  trivially  true  if  j  =  0,  since  then  “just 
before  the  current  subphase”  refers  to  the  beginning  of  the  phase,  and  no 
space  is  blocked  at  that  time.  For  j  >  0  “just  before  the  current  subphase” 
refers  to  the  end  of  the  previous  subphase.  Thus  any  job  scheduled  in  a 
space  that  was  blocked  at  that  time  would  be  scheduled  between  two  jobs  of 
running  time  T  that  were  scheduled  during  the  current  phase,  and  thus  are 
still  running.  By  the  definition  of  blocked  space  the  space  between  those  two 
jobs  is  at  most  D^TS^~^  =  ID'^T  =  o(S^  fDT).  Scheduling  a  job  in  such 
a  small  space  would  violate  our  first  assumption  about  virtualization. 

Since  the  algorithm  is  randomized,  we  cannot  ensure  that  the  adversary 
always  blocks  some  fixed  amount  of  space,  as  opposed  to  the  deterministic 
proof  in  [FKST93].  Instead,  we  measure  the  expected  sum  of  the  blocked 
space  and  the  length  of  the  schedule.  This  measure  is  very  much  like  a 
potential  function  in  amortized  analysis.  From  the  overview  of  the  proof  it 
follows  that  i/D  fraction  of  the  space  should  have  approximately  the  same 
weight  cis  a  time  interval  of  length  Tf  D. 

Formally  we  define  the  waste  at  a  given  time  to  be  the  sum  of  the  blocked 
space  divided  by  NJD  plus  the  current  length  of  the  schedule  divided  by 
r/100£).  We  measure  the  waste  in  units:  one  unit  of  waste  corresponds  to 
NJ D  of  the  blocked  space  or  to  time  T/IOOO  in  the  length  of  the  schedule. 
The  increase  of  the  waste  is  the  difference  between  the  current  waste  and  the 
waste  just  before  the  current  subphase. 

Note  that  the  waste  at  the  l)eginning  of  a  subphase  and  at  the  end  of 
the  previous  subphase  can  l)e  dilferent.  since  blocked  space  is  defined  in 
the  context  of  the  current  subphase.  The  waste  can  decrease  only  at  the 
beginning  of  any  phase  or  after  time  T  of  any  phase. 
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Estimating  the  progress 

The  following  lemma  says  that  if  many  undecided  jobs  are  running  at  once, 
the  expected  increase  of  the  weiste  is  high.  Once  we  know  that,  the  rest  of 
the  proof  follows  easily. 

Lemma  14.3  Suppose  that  the  time  since  the  beginning  of  the  current  phase 
is  less  than  T.  If  at  least  23N/D  processors  are  assigned  to  undecided  jobs, 
then  the  expected  increase  of  the  waste  at  the  end  of  the  current  subphase  is 
at  least  least  2  units. 

Proof.  We  prove  that  the  expected  number  of  processors  that  were  not 
blocked  just  before  the  current  subphase  but  are  blocked  after  the  running 
times  of  the  undecided  jobs  are  assigned  is  at  least  ‘2NjD.  This  is  sufficient, 
since  then  by  the  definition  the  expected  increase  of  waste  is  at  least  2  units. 

Divide  the  processors  that  were  not  blocked  just  before  the  current  sub¬ 
phase  into  segments  of  length  at  most  2DT so  that  no  segment  is  shorter 
than  DTS^  unless  it  is  adjacent  to  a  blocked  processor  or  the  end  of  the 
mesh  on  both  ends.  (This  is  possible  since  any  segment  of  at  leeist  DTS^ 
processors  can  be  divided  into  segments  of  size  between  DTS^  and  2DTS^.) 
Mark  each  of  the  segments  with  at  lecust  DTS^  processors  if  at  least  [/D 
fraction  of  its  processors  is  assigned  to  undecided  jobs. 

Every  marked  interval  contains  at  least  TS^  processors  assigned  to  unde¬ 
cided  jobs,  hence  it  intersects  at  least  T  undecided  jobs.  The  probability  that 
all  these  jobs  will  be  assigned  running  time  I  is  at  most  (1  —  l/T)^  <  I/k. 
Therefore  for  any  two  marked  intervals  with  at  most  D^TS-’ /2  processors  be¬ 
tween  them,  the  probability  that  the  segment  between  them  will  be  blocked 
is  at  least  (1  —  1/e)^  >  1/3.  as  the  two  events  are  independent.  (There  is 
a  possibility  that  one  of  the  undecided  jobs  is  intersected  by  both  marked 
intervals,  in  which  case  the  events  are  not  quite  independent.  However,  then 
both  marked  intervals  intersect  T  undecided  jobs  distinct  from  the  common 
one,  and  the  bound  is  still  true.) 

Now  we  show  that  there  are  many  marked  intervals.  It  then  follows 
that  a  constant  fraction  of  them  lies  between  two  marked  intervals  that  are 
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sufficiently  close,  and  hence  on  average  a  constant  fraction  of  marked  intervals 
will  be  blocked. 

The  segments  shorter  than  DTS^  are  all  blocked  and  were  not  blocked 
just  before  the  current  subphase.  Hence  if  their  total  length  is  at  least  2NfD, 
we  are  done;  otherwise  they  contain  at  most  2NfD  processors  assigned  to 
undecided  jobs.  The  unmarked  segments  with  at  least  DT processors  con¬ 
tain  a  total  of  at  most  N/ D  processors  assigned  to  undecided  jobs.  There¬ 
fore  at  least  20iV/ D  of  the  processors  assigned  to  the  undecided  jobs  are  in 
the  marked  segments  amd  the  number  of  the  marked  segments  is  at  least 
{20NID)/2DTS^  =  lON/D^TS^. 


Let  the  envelope  of  a  marked  segment  be  the  largest  segment  of  the  mesh 
containing  it  which  does  not  intersect  any  other  marked  segment  and  does  not 
contain  any  processor  that  was  blocked  at  the  end  of  the  previous  subphase. 
Each  processor  is  contained  in  at  most  two  envelopes,  hence  the  sum  of  sizes 
of  all  envelopes  is  at  most  2N.  It  follows  that  there  are  at  least  6N/D‘^TS^ 
marked  segments  with  envelope  of  size  at  most  D^TS^f2,  since  otherwise 
the  sum  of  the  sizes  of  the  envelopes  of  the  remaining  more  than  4NID^T 
marked  segments  is  more  than  2N.  Each  of  these  segments  with  a  small 
envelope  is  adjacent  at  both  ends  to  a  marked  interval,  blocked  processor  or 
the  end  of  the  mesh,  hence  it  will  be  blocked  with  probability  at  least  1/3. 
Thus  the  total  expected  length  of  the  marked  intervals  that  will  be  blocked 
is  at  letist 

1  6N 


By  the  definition  of  marked  segments  this  area  Wcis  not  blocked  at  the  end 
of  the  previous  phase.  □ 


Now  we  are  ready  to  prove  the  main  theorem  of  this  section. 


Theorem  14.4  No  randomized  on-line  scheduling  algorithm  can  achieve  a 
better  competitive  ratio  than  for  an  one-dimensional  mesh  of  N 

processors. 

First  we  prove  that  the  expected  increase  of  the  waste  at  the  end  of  each 
subphase  is  at  least  2  units  as  long  as  the  length  of  the  current  phase  is  at 
most  T. 
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If  at  any  time  during  this  subphase  the  number  of  processors  assigned  to 
undecided  jobs  is  more  than  2ZN / Z>,  the  expected  increase  of  the  waste  is 
high  by  Lemma  14.3.  If  this  is  not  true  at  any  point  of  the  subphase,  we 
prove  that  the  expected  length  of  the  subphase  is  at  least  T/SOZ?,  and  hence 
the  increase  of  the  waste  is  at  least  2  units. 

From  our  second  assumption  about  virtualization  it  follows  that  at  any 
time  at  most  23DNfS^  undecided  jobs  are  scheduled  on  any  23NfD  proces¬ 
sors.  By  the  assumption  no  T  undecided  jobs  are  scheduled  on  o{l/D)  frac¬ 
tion  of  the  number  of  processors  they  request,  hence  those  23DNIS^  unde¬ 
cided  jobs  would  have  to  be  scheduled  on  at  least  Q{{23DN/S^  —T)S^fD)  = 
n(iV)  processors. 

The  expected  number  of  non-critical  jobs  scheduled  before  the  critical 
one  is  TNf2D^S^.  By  the  previous  argument  only  23DN/S^  of  them  can  be 
assigned  running  time  at  the  end  of  the  subphase.  Thus  the  expected  total 
work  done  by  all  jobs  on  the  given  level  while  they  £ire  undecided  is  at  least 
TN/2D^  —  23DN  >  (1  —  o{l))T N/2D^ .  Since  undecided  jobs  are  scheduled 
on  at  most  23N/D  processors,  for  a  sufficiently  large  N  it  takes  expected  time 
(1  —  o{l))TfA6D  >  Tf50D  to  perform  the  required  work,  which  concludes 
the  proof  that  the  expected  increase  of  the  waste  during  the  subphase  is  at 
least  2  units. 

Thus  if  the  length  of  some  phase  is  at  most  T,  the  toted  expected  increase 
of  the  waste  during  all  D  subphases  of  that  phase  is  at  least  2D.  The  total 
amount  of  the  blocked  space  is  at  most  N.  which  corresponds  to  D  units 
of  the  waste.  Hence  at  least  D  units  of  the  expected  increase  of  the  waste 
are  contributed  by  the  length  of  the  schedule,  and  therefore  the  expected 
length  of  the  phase  is  at  le<ist  D,{T).  It  follows  that  the  expected  length  of 
the  schedule  is  Q{DT). 

This  expectation  is  taken  not  only  over  the  random  bits  of  the  algorithm, 
but  also  over  the  random  instances  produced  by  the  adversary:  therefore 
we  still  have  to  argue  that  this  proves  that  the  competitive  ratio  is  at  least 
il{D).  Suppose  that  the  competitive  ratio  is  o{D).  Then  for  each  instance 
the  expected  length  of  the  on-line  schedule  is  at  most  o{D)  time  the  length 
of  the  optimal  schedule.  Averaging  over  all  instances  we  get  that  the  ex- 
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pected  length  of  the  on-line  schedule  is  o{DT),  since  the  average  length  of 
the  optimal  schedule  is  0{T).  This  contradiction  finishes  the  proof.  □ 


15  Scheduling  without  virtualization 

In  this  section  we  demonstrate  the  importance  of  virtualization  for  on-line 
scheduling  with  dependencies.  We  prove  that  if  a  job  is  allowed  to  require 
whole  machine,  no  deterministic  on-line  algorithm  can  achieve  a  better  com¬ 
petitive  ratio  than  N,  and  randomization  improves  the  competitive  ratio  by 
at  most  a  factor  of  two. 

We  can  achieve  a  matching  upper  bound  in  the  deterministic  case  by 
scheduling  one  job  at  a  time,  without  even  attempting  to  schedule  the  jobs 
in  parallel.  This  demonstrates  that  virtualization  is  essential  in  the  design  of 
competitive  scheduling  algorithms  in  our  model  of  on-line  scheduling.  It  also 
shows  yet  another  difference  between  on-line  scheduling  with  dependencies 
and  scheduling  without  dependencies.  We  have  seen  in  Part  II  that  without 
dependencies  neither  the  size  of  the  largest  job  nor  virtualization  changes  the 
competitive  ratios  dramatically. 

By  the  same  method  we  prove  a  lower  bound  on  the  competitive  ratio  of 
a  deterministic  on-line  algorithm  if  the  number  of  processors  required  by  a 
job  is  restricted  to  be  at  most  some  constant  fraction  of  the  machine.  We 
prove  a  matching  upper  bound  for  PRAMs.  This  again  gives  us  a  tradeoff 
between  the  size  of  the  largest  jol>  and  the  competitive  ratio.  Compared 
to  the  analogous  tradeoff  with  virtualization  allowed  in  Section  13.  here  the 
optimal  competitive  ratio  is  significantly  higher  and  it  is  more  ilependenl  on 
the  maximal  size  of  the  jobs. 

We  only  give  the  proof  for  fully  on-line  algorithms.  It  can  be  generalized 
to  all  on-line  algorithms  by  the  metho<l  used  in  the  proof  of  Theorem  i:i.3. 

Theorem  15.1  Assuming  that  virtualization  is  not  nllowrd.  the  following 
holds. 

(i)  No  deterministic  on-line  scheduling  algorithm  can  achieve  a  sthaller 
competitive  ratio  than  N  on  any  machine  with  N  processors. 
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(it)  No  randomized  on-line  scheduling  algorithm  can  achieve  a  smaller 
competitive  ratio  than  N/2  on  any  machine  with  N  processors. 

(Hi)  Suppose  that  the  largest  number  of  processors  requested  by  a  job  on 
a  machine  with  N  processors  is  \N ,  for  a  fixed  0  <  A  <  1,  Then  no  deter¬ 
ministic  on-line  scheduling  algorithm  can  achieve  a  smaller  competitive  ratio 
than  1  +  • 

Proof.  We  prove  the  last  part  of  the  theorem  first,  since  the  other  two  parts 
then  follow  easily. 

(iii)  The  proof  is  similar  to  that  of  Theorem  13.3.  The  job  system  used 
by  the  adversary  has  iV  —  1  levels.  Each  level  has  N  —  [AiV J  +  2  sequential 
jobs  and  one  parallel  job  requiring  [AA^J  processors.  The  parallel  job  depends 
on  one  of  the  sequential  jobs  from  the  same  level:  all  sequential  jobs  depend 
on  the  parallel  job  from  the  previous  level.  In  addition  there  is  one  more 
sequential  job  dependent  on  the  last  parallel  job. 

In  the  beginning  the  algorithm  can  schedule  only  sequential  jobs.  The 
adversary  enforces  that  on  each  level  the  sequential  job  which  the  parallel 
job  depends  on  is  started  last;  this  is  possible  since  the  algorithm  cannot 
distinguish  between  the  sequential  jobs.  The  adversary  terminates  this  se¬ 
quential  job  and  keeps  the  other  sequential  jobs  running  for  some  sufficiently 
large  time  T.  During  this  time  the  scheduling  algorithm  cannot  schedule  the 
parallel  job.  As  soon  as  the  parallel  job  is  scheduled,  the  adversary  termi¬ 
nates  it  and  all  remaining  jobs  of  this  level.  This  process  is  repeated  until 
all  jobs  except  the  last  sequential  job  liave  been  scheduled.  The  ativersary 
assigns  time  T'  =  (N  —  [AiVJ  -h  1  )T  to  the  last  job.  The  total  length  of  the 
generated  schedule  is  (N  —  \  )T  -\-T'  =  ['2N  —  [AAJ  )T. 

The  off-line  algorithm  first  schedules  all  jobs  of  time  0;  then  schedules 
the  sequential  job  with  running  time  T'  and  in  parallel  with  it  all  the  other 
sequential  jobs.  There  are  {N  —  l)(iV  —  [A/Vj  -f- 1)  such  jobs,  all  with  running 
time  at  most  T  and  independent  of  each  other.  The  schedule  for  them  on 
N  —  I  processors  takes  time  at  most  (:V  —  [AiVJ  -f-  UT  =  T'.  So  the  length 
of  the  off-line  schedule  is  at  most  T'  and  the  competitive  ratio  is  at  least 
=  1  +  This  is  arbitrarily  close  to  1  -I-  for  large  N 

and  constant  A. 
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(i)  The  proof  is  essentially  a  speci2J  case  of  the  proof  of  (iii)  for  A  =  1. 
Each  level  of  the  job  system  has  two  sequential  jobs  and  one  par2dlel  job 
requiring  the  whole  machine.  All  sequential  jobs  with  no  dependent  jobs 
have  running  time  T.  As  before,  the  adversary  enforces  that  these  jobs  are 
scheduled  sequentially,  while  the  optimal  schedule  runs  all  N  of  *hem  in 
parallel.  This  gives  the  competitive  ratio  N.  (In  this  case  we  se  a 
simpler  job  system  which  works  directly  even  for  the  algorithms  tht  e  not 
fully  on-line,  see  [FKST93].) 

(ii)  We  use  the  same  job  system  as  in  the  proof  of  (i).  As  the  two 

sequential  jobs  are  not  distinguishable,  the  adversary  cam  guarantee  that 
with  probability  at  least  1  /2  the  randomized  on-line  algorithm  schedules  the 
job  with  nonzero  ruiming  time  first.  In  that  case  the  parallel  job  from  tha^ 
cannot  be  scheduled  until  this  sequential  job  finishes.  Thus  the  expected 
length  of  the  schedule  produced  by  the  on-line  adgorithm  is  at  least  half  of 
the  total  running  time  of  jobs  with  nonzero  running  time,  which  gives  the 
competitive  ratio  of  N/2.  □ 

Now  we  show  that  this  lower  bound  is  tight  for  PRAM.  In  fact,  any  greedy 
strategy  achieves  this  bound. 

Algorithm  GREEDY 

while  not  all  jobs  are  finished  do 

if  some  job  J  requires  p  processors  and  p  processors  are  available, 
then  schedule  ./  on  the  p  processors. 

Theorem  15.2  Suppose  that  the  largest  number  of  processors  requested  by 
a  job  is  \N ,  where  0  <  A  <  1.  Then  GREEDY  is  (1  -t-  j^)-competitive. 

Proof.  No  job  requests  more  than  [AAfJ  processors.  Therefore  if  the  effi¬ 
ciency  is  less  than  I  —  A,  there  is  no  available  job.  By  Lemma  7.2  this  time 
is  smaller  than  the  total  time  along  some  path  in  the  dependency  graph  and 
hence  7’<(i_a)  <  Tmax  <  Topt.  Lemma  7.1  finishes  the  proof.  □ 
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16 


Tree  dependency  graphs 

In  this  section  we  prove  Theorem  4.1,  which  says  that  scheduling  job  systems 
whose  dependency  graphs  are  trees  is  as  difficult  as  scheduling  general  job 
systems. 

First  note  that  a  similar  theorem  is  easy  to  prove  if  we  restrict  ourselves 
to  fully  on-line  algorithms.  Suppose  that  we  have  a  fully  on-line  algorithm  for 
tree  dependency  graphs  and  a  general  dependency  graph  T.  We  dynamically 
construct  a  tree  subgraph  T*  oi  T  2uid  use  the  algorithm  on  T*.  We  can  do 
this  because  the  fully  on-line  algorithm  does  not  know  the  dependencies  in 
advance.  For  each  job  J  we  only  keep  the  edge  from  a  job  J'  such  that  J 
became  available  when  J'  finished.  The  generated  schedule  is  a  valid  schedule 
for  both  T  and  and  the  optimal  schedule  for  can  only  improve  if  some 
dependencies  are  removed.  Therefore  the  achieved  competitive  ratio  for  T  is 
no  greater  than  the  ratio  for  the  tree  dependency  graph  ^ . 

In  the  next  theorem  we  prove  the  same  statement  for  the  more  difficult 
case  is  when  the  on-line  algorithm  may  know  the  dependency  graph  in  ad¬ 
vance. 

Theorem  16,1  Ltt  an  on-line  scheduling  problem  with  dependencies  be 
given  (i.e.,  a  specific  architecture  and  simulation  factors).  Then  the  optimal 
competitive  ratio  for  this  problem  is  equal  to  the  optimal  competitive  ratio  for 
a  restricted  problem  in  which  we  allow  only  job  systems  whose  dependency 
graphs  are  trees  as  inputs. 

Proof.  Obviously  the  optimal  competitive  ratio  for  the  restricted  problem  is 
at  most  the  ratio  for  the  general  problem.  To  prove  the  reverse  implication, 
cissume  that  we  have  a  «T-competitive  algorithm  S  for  the  restricted  prob¬ 
lem.  Using  it,  we  show  how  to  schedule  a  general  job  system  so  that  the 
competitive  ratio  is  at  most  <t. 

Given  a  general  dependency  graph  .F.  we  create  a  job  system  with  a  tree 
depxendency  graph  .  Then  we  use  the  schedule  for  T'  produced  by  5  to 
schedule  T.  We  determine  the  running  times  of  jobs  in  T'  dynamically  based 
on  the  running  times  of  jobs  in 
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The  set  of  jobs  of  T*  is  the  set  of  all  directed  paths  in  starting  with 
any  job  that  has  no  predecessor.  There  is  a  directed  edge  (p,  q)  in  .F'  if  p  is 
a  prefix  of  q.  If  J  €  .F  is  the  last  node  of  p  €  T',  then  p  is  called  a  copy  of 
J.  The  resource  requirements  of  each  copy  of  J  are  the  same  as  those  of  J. 
A  path  p  is  the  last  copy  of  J  if  it  is  the  last  copy  of  J  to  be  scheduled.  Let 

be  the  subgraph  of  consisting  of  ail  last  copies  and  all  dependencies 
between  them. 

Our  scheduling  algorithm  works  as  follows.  We  run  S  on  .F'.  Suppose  S 
schedules  p  €  F.  If  p  is  the  last  copy  of  some  J,  then  we  schedule  J  to  the 
same  set  of  processors  as  p  was  scheduled  by  5.  If  p  is  not  the  last  copy,  then 
we  remove  p  and  all  jobs  that  dependent  on  it.  If  a  job  J  €  F  finishes,  we 
stop  its  last  copy  p  €  F. 

Notice  that  if  p  is  the  last  copy  of  J,  then  we  schedule  J  at  the  same 
time,  on  the  same  set  of  processors  and  with  the  same  running  time  as  5 
schedules  p.  All  other  copies  of  J  are  immediately  stopped. 

To  show  correctness  of  our  schedule,  we  need  to  prove  that  when  the  last 
copy  of  J  is  available  to  5.  J  is  available  to  us.  Suppose  this  is  not  the  case. 
Then  there  is  some  F  €  F  such  that  J  depends  on  J'  and  J'  has  not  finished 
yet.  Then  the  last  copy  of  J' .  say  q  €  F'.  is  also  not  finished,  and  there  is 
a  copy  p  of  J  such  that  7  is  a  prefix  of  p.  So  p  is  a  copy  of  J  that  is  not 
available  yet.  a  contradiction. 

The  schedule  S  generated  for  F'  and  our  schedule  for  F  have  the  same 
length.  Only  the  jobs  from  F'  (i.e.,  the  last  copies  of  the  jobs  from  F)  are 
relevant  in  F':  all  other  jobs  have  ninniiig  fiineO  and  can  l)e  scheduled  at  the 
end.  By  construction,  every  schedule  for  F  corresponds  to  a  schedule  for  F" 
and  therefore  to  a  schedule  of  F'.  So  tlie  com|)etitive  ratio  for  F  is  not  larger 
than  the  competitive  ratio  for  F'.  which  is  at  most  tr  by  the  assumption  of 
the  theorem.  □ 

Algorithmically,  the  above  reduction  from  general  constraints  to  trees  is 
not  completely  satisfactory,  because  F  can  be  exponentially  larger  than  F. 
Nevertheless,  it  proves  an  important  property  of  on-line  scheduling  from  the 
viewpoint  of  competitive  analysis. 
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Part  IV 


Randomized  scheduling  of 
sequential  jobs 

17  The  model  and  the  previous  results 

The  problem  of  on-line  scheduling  of  sequentizil  jobs  was  introduced  in  1966 
by  Graham  [Gra66].  In  this  model  we  have  m  processors  and  a  sequence 
of  jobs  arriving  one  by  one;  there  are  no  dependencies  between  the  jobs. 
The  jobs  are  sequential,  i.e.,  they  require  only  one  processor.  When  a  job 
2urrives,  we  know  its  running  time  and  we  have  to  assign  it  to  one  of  the 
processors  immediately,  without  knowledge  of  the  jobs  that  arrive  later.  The 
jobs  cannot  be  preempted. 

As  in  our  model  for  parallel  jobs,  the  goal  is  to  minimize  the  makespan, 
the  time  when  the  last  job  is  completed.  In  the  randomized  model  we  measure 
the  expected  makespan.  The  performance  of  an  on-line  algorithm  is  again 
measured  by  the  competitive  ratio. 

In  Section  18  we  improve  the  previously  known  lower  bounds  on  the  com¬ 
petitive  ratio  of  randomized  on-line  algorithm  for  this  problem.  This  results 
was  proved  independently  by  Chen,  van  Vliet  and  Woeginger  [CvVW94a]. 
Now  we  survey  the  previous  resuHs. 

While  the  deterministic  Ccise  has  been  studied  extensively  [Gra66,  GW93. 
BFKV92,  KPT94],  much  less  is  known  about  the  randomized  case;  all  of  the 
following  results  are  by  Bartal,  Fiat,  Karloff,  and  Vohra  [BFKV92].  Only  the 
case  of  m  =  2  is  solved  completely;  an  optimal  4/3-competitive  algorithm  is 
known  and  provably  better  than  any  deterministic  algorithm.  For  m  =  3  a 
nontrivial  lower  bound  of  1.4  was  proved.  For  m  >  3,  the  best  known  lower 
bound  vtas  the  easy  4/3  bound,  not  even  matching  the  bound  for  m  =  3.  For 
m  >  3  no  randomized  algorithm  with  a  better  comi>etitive  ratio  than  the 
best  deterministic  algorithm  is  known. 

For  a  long  time  the  best  deterministic  algorithm  was  the  (2  —  ^)- 
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competitive  Graham’s  algorithm  [Gra66l.  For  m  =  2  and  m  =  3,  Graham’s 
algorithm  is  provably  optimal.  For  larger  m,  it  was  improved  several  times 
during  the  last  few  years  [GW93,  BFKV92,  KPT94|.  For  sufficiently  large  m 
the  best  known  algorithm  is  1.945-competitive  [KPT94).  For  m  >  3,  an  algo¬ 
rithm  slightly  better  than  Graham’s  is  presented  in  [GW93].  The  best  lower 
bound  on  the  competitive  ratio  of  the  deterministic  scheduling  algorithm  for 
large  m  is  approximately  1.837  (BKR);  an  earlier  bound  of  1  -b  l/\/2  w  1.707 
is  valid  for  any  m  >  3  [FKT89]. 

For  a  related  model,  deterministic  preemptive  on-line  scheduling  of  se¬ 
quential  jobs,  Chen,  van  Vliet  and  Woeginger  [CvVW94b]  proved  the  same 
lower  bound  as  we  prove  for  randomized  nonpreemptive  algorithms,  and  gave 
a  matching  algorithm. 

18  An  improved  lower  bound 

We  prove  that  the  competitive  ratio  of  any  randomized  on-line  scheduling 
algorithm  for  m  machines  is  at  least  1  -h  l/((;;;^)’"  —  1).  This  matches  the 
known  tight  bound  of  4/3  for  m  =  2  and  improves  the  previous  bounds  for 
all  m  >  2.  If  m  is  large,  this  bound  approaches  l-|-l/(e  —  1)  Rs  1.582.  Table  8 
compares  the  bounds  for  a  few  values. 


number  of  processors 

our  bound 

previous  bound 

2 

4/3  w  1.333 

4/3 

3 

27/19  «  1.421 

1.4 

4 

256/175  ss  1.463 

4/3 

5 

3125/2101  w  1.487 

4/3 

6 

46656/31031  w  1.504 

4/3 

30 

1  -f-  l/(e  -  1)  «  1.582 

4/3 

Table  8:  Lower  bounds  on  the  competitive  ratio  for  randomized  on-line 
scheduling  of  sequential  jobs. 
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The  intuition  for  our  lower  bound  can  be  better  understood  if  we  look 
at  the  algorithm  and  the  lower  bound  for  two  processors  from  [BFKV92]. 
Their  algorithm  guaraintees  that  the  ratio  of  the  exp>ected  loads  on  the  two 
processors  is  2  :  1  most  of  the  time  (more  exactly,  unless  there  is  a  job 
with  a  very  long  running  time),  where  the  load  of  a  processor  is  defined 
as  the  total  running  time  of  all  jobs  assigned  to  the  processor,  also  called 
height  in  [BFKV92].  Their  lower  bound  essentially  shows  that  any  optimal 
zdgorithm  has  to  maintain  this  ratio  of  expected  loads,  and  that  it  is  not 
possible  to  maintain  a  better  ratio.  The  optimal  algorithm  can  balance  the 
loads  exactly,  hence  the  competitive  ratio  is  2/1.5  =  4/3. 

In  the  case  of  m  processors  we  show  that  the  best  ratio  of  loads  which 
might  be  possible  to  maintain  is  1  :  x  :  :  . . .  :  x"*”*,  where  x  =  m/(m  —  1). 

The  optimal  schedule  can  balance  the  loads  to  be  (1  +  x  H - 1-  x"*~‘)/m  = 

m(x"*  —  l)/(x  —  1),  hence  the  competitive  ratio  is  at  least  x'^~^ f — 
l)/(x  —  1)),  which  gives  our  bound.  See  Figures  8  and  9  for  an  illustration. 

We  first  prove  a  general  lemma  which  applies  to  any  sequence  of  jobs. 
For  our  proof  the  last  m  jobs  of  the  sequence  are  most  important.  In  fact  in 
the  particular  sequence  that  we  use  later,  the  only  purpose  of  the  other  jobs 
is  to  pad  the  sequence  so  that  the  optimal  schedule  can  always  balance  the 
loads  exactly. 

Let  a  sequence  of  jobs  J  be  given.  Denote  the  last  m  jobs  of  J  Ijy 
J\i  -  •  •  iJm  and  their  running  times  by  ti, . . . ,  tm-  Let  Ji  be  an  initial  segment 
of  J  ending  by  7,  (i.e.,  Jm  =  J)-  Let  5,-  be  the  sum  of  the  running  times 
of  all  jobs  in  Ji  and  let  be  the  length  of  an  optimjil  schedule  for  J^. 

For  a  given  randomized  algorithm  .4.  E[T,\(Ji)]  tlenotes  the  expected  length 
of  the  schedule  it  generates  on  Ji. 

The  definition  of  the  competitive  ratio  cr  implies  that  for  any  choice  of 
the  sequences  of  jobs  Ji  we  have 

Er..  ?■„„.( J,)  -  Eii, J.) 

The  following  lemma  gives  an  upper  bound  on  the  denominator  of  the  left- 
hand  side. 
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m  processors 


m  processors 


Figure  8:  An  optimal  on-line  sched-  Figure  9:  An  optimal  off-line  sched¬ 
ule  for  the  sequence  of  jobs  used  in  iilt  for  the  sequence  of  jobs  used  in 
the  lower  bound  for  randomized  on-  the  lower  bound  for  randomized  on¬ 
line  scheduling  of  sequential  jobs.  line  scheduling  of  sequential  jobs. 


102 


Lemma  18.1  For  any  randomized  on-line  scheduling  algorithm  A  form  ma¬ 
chines,  ZZ,  E[TAiJi)]  >  5„. 

Proof.  Fix  a  sequence  of  random  bits  used  by  the  algorithm  A.  Let  Ti  be 
the  makespan  of  the  schedule  generated  by  the  algorithm  A  for  the  jobs  in  Si 
with  the  fixed  random  bits.  Since  the  algorithm  is  on-line,  the  schedule  for  J7i 
is  obtained  from  the  schedule  for  Ji_i  by  adding  Ji  to  one  of  the  processors. 

Order  the  processors  so  that  the  load  of  ith  processor  does  not  change 
after  J,  is  scheduled.  There  always  exists  such  an  ordering:  Designate  a 
processor  to  be  the  ith  one,  if  J,-  is  the  last  job  scheduled  on  it.  Clearly  at 
most  one  processor  is  designated  as  the  ith  one.  If  no  processor  is  designated 
as  the  jth  one  for  some  j,  pick  one  of  the  remaining  processors  arbitrarily; 
note  that  no  job  J,-  is  scheduled  on  these  processors.  This  defines  an  ordering 
satisfying  the  condition. 

Let  Li  be  the  load  of  ith  processor  after  scheduling  all  jobs.  Obviously 
ZZi  The  load  of  ith  processor  is  L,  already  after  scheduling  S 

because  of  our  condition  on  the  order  of  the  processors.  Therefore  T,  >  Li 
Ti  >  Li  =  Sm  for  any  choice  of  the  random  bits. 

The  lemma  follows  by  the  linearity  of  expectation.  □ 

Let  X  =  mf{m~l).  We  choose  our  sequence  of  jobs  so  that  the  following 
property  is  satisfied.  If  all  the  jobs  preceding  J,  are  scheduled  so  that  the 
ratio  of  loads  is  1  :  i :  . . .  .  x”*"*,  then  after  scheduling  Ji  on  the  leeist  loaded 
processor  the  ratio  of  the  loads  is  preserved. 

Let  T  —  (m  —  1)^'"“*.  The  sequence  J  consists  of  (1  -t - 1-  jobs 

of  running  time  1  followed  by  m  jobs  with  running  time  <,  =  (x"*  —  l)x'~*T. 
The  value  of  T  is  chosen  so  that  all  running  times  are  integers.  Optimal 
on-line  and  off-line  schedules  are  illustrated  on  Figures  8  and  9. 

After  scheduling  the  jobs  preceding  ./,  so  that  the  ratio  is  as  described 
above,  the  loads  of  the  processors  are  x‘~‘T, . . . ,  x'''''"“^T.  For  i  =  1  this  is 
obvious.  It  is  maintained  inductively,  since  by  our  choice  of  running  times  we 
have  x‘“*T 4-f,  =  x‘‘'’'"~‘T,  which  is  exactly  the  condition  we  need  to  preserve 
the  ratio  of  the  loads.  The  next  theorem  says  that  no  on-line  algorithm  can 
do  better,  even  if  it  is  randomized. 
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Theorem  18.2  For  any  randomized  on-line  scheduling  algorithm  for  m  ma¬ 
chines,  the  competitive  ratio  <Tm  is  at  least  1  +  —  1). 

Proof.  First  we  prove  that  Topt{Ji)  =  ti.  The  total  running  time  of  all  jobs 
in  is 


Si  =  +  +  +  + 

=  (x'  +  •  •  •  +  x^-^)T  +  i"‘(l  +  •  •  •  +  x‘-‘)T 
=  i‘(l +  ---  +  x”‘-‘)r 


Since  running  times  of  all  jobs  are  integers,  at  most  m  of  them  axe  greater 
than  1  and  t,  is  the  maximal  running  time,  the  loads  of  the  m  processors  can 
be  balanced  perfectly  and  Topti  Ji)  =  U. 

Using  l.emma  18.1  and  Sm.  =  ^'"(l  +  •••  +  x”'~^)T  from  the  previous 
computation,  the  competitive  ratio  Cm  for  any  randomized  on-line  algorithm 
A  is  bounded  by 

^  El,  e(T4(j;))  ^  s„ 

'^m  —  ’T’  i  'T  \  —  ^TTi  X 

Z^i=t  Z-i=l 

x’"(i  -I-  •••  -hx"»-^)r _ x"*  1 

“  (1  +  ...  -hx'"-‘)(x"'  -  l)r  ~  X'™  -  1  “  ^  -  r 

□ 

We  conjecture  that  there  exists  a  randomized  algorithm  which  matches 
this  bound,  but  we  are  unable  to  prove  this  at  the  present  time. 
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